Are there any instances where a negative time type could give unexpected results? - kdb

Are there any instances where a negative time type could give unexpected results if used for specific purposes? When time deltas are calculated between negative time values and non-negative time values for example, there do not appear to be any issues.
time
val
00:00:31.384
-0.3170017
00:06:00.139
0.9033492
00:07:01.099
-0.7661049
Then, for the purpose of a window join later over a 10-min window
win:00:10:00;
winForJoin: (neg win;00:00:00) +\:(exec time from data);
first[winForJoin] gives -00:09:28.616 -00:03:59.861 -00:02:58.901
winForJoin[1]-winForJoin[0] gives 10 minutes as expected

If I understand correctly, you're asking how would a window join behave if the opening interval was a negative time? (due to the interval subtraction taking the values into negative territory, relative to 00:00).
The simple answer is that it won't behave any differently than if the times were numbers, but in practice you may see results you don't expect depending on how your table is set up and what you're trying to achieve.
Taking the example in the official wiki as a starting point: https://code.kx.com/q/ref/wj/
q)t:([]sym:3#`ibm;time:10:01:01 10:01:04 10:01:08;price:100 101 105)
q)a:101 103 103 104 104 107 108 107 108
q)b:98 99 102 103 103 104 106 106 107
q)q:([]sym:`ibm; time:10:01:01+til 9; ask:a; bid:b)
q)f:`sym`time
q)w:-2 1+\:t.time
/add volume too so it's easier to follow:
q)s:908 360 522 257 858 585 90 683 90;
q)update size:s from `q
/add an alternative range which has negative starting time
q)w2:(-11:00;1)+\:t.time
The window join takes all rows in q whose times are between the pairs of time ranges:
q)q[`time]within/:flip w
110000000b
011110000b
000001111b
Under the covers it's asking: are these positive numbers (the quote times) in between those two positive numbers (the window range). There's no reason it can't also ask: are these positive numbers in between this negative number and this positive number
q)q[`time]within/:flip w2
110000000b
111110000b
111111111b
You'll notice that all of them are greater than the negative time - meaning that it will include all rows from the beginning of the q table, up until the end time of that pair. This can be considered expected behaviour - if your start time is negative you must mean "from the beginning of time" - aka, all rows from the beginning of the table.
Comparing sum of size shows how the results differ:
q)wj[w;f;t;(q;(sum;`size))]
sym time price size
-----------------------
ibm 10:01:01 100 1268
ibm 10:01:04 101 1997
ibm 10:01:08 105 1448
q)wj[w2;f;t;(q;(sum;`size))]
sym time price size
-----------------------
ibm 10:01:01 100 1268
ibm 10:01:04 101 2905
ibm 10:01:08 105 4353
Finally - where it might get complicated.....it depends on what "negative" time means in your table. If you're at 00:00 (midnight) and you subtract 10 minutes, are you trying to access data from 23:50 the day before? Or does 00:00 represent the starting time (row zero) of your table? If you're trying to access 23:50 from the day before then you will have problems because 23:50 is NOT in between your negative start time and your positive end time, e.g:
q)23:50 within(-00:58:59;10:01:02)
0b
Again this all depends on how your data looks and what you're trying to do

Related

Grouping data into buckets by frequency in Postgres 11.6

Using Postgres 11.6, I'm trying to analyze some event data. The goal is to find the durations for all events with a specific name, and then split each one out into evenly sized buckets. We're looking for any times that "clump" for a specific event. I'm editing my question as the specific case may be obscuring what I'm trying to ask.
Simple example
The question is "how do you group rows by a value, then split occurrences by frequency into buckets with count and average for each of those buckets." Here's a hand-done toy example with rounded averages:
Months with values, each number here represents a row.
Jan 12 24 60 150 320 488
Feb 8 16 40 100 220
Mar 4 8 20 310
Overall figures
Month Count Avg Min Max
Jan 6 176 12 488
Feb 5 77 8 220
Mar 4 86 4 310
The same original data, but with more data, including repeated values and a wider range.
Jan 12 12 12 12 24 24 60 60 150 320 488 500
Feb 8 8 8 8 8 16 40 100 220 440 1100
Mar 4 8 8 8 8 20 20 20 20 310
Overall figures
Month Count Avg Min Max
Jan 12 140 12 500
Feb 11 178 8 1100
Mar 10 43 4 310
Mock-up of one of the sets of data split out into 3 buckets
Month Count Avg Min Max Bucket
Jan 4 12 12 12 0
Jan 4 42 24 60 1
Jan 4 365 150 500 2
...and so on for Feb and Mar
I'm just guessing at how the buckets would split in the mock-up above.
That pretty much captures what I'm trying to do. Group by month name (from_to_node in my real case), split the resulting rows into buckets, and then get min, max, avg, and count for each bucket. It's starting to sound like a pivot (?)
Real Table Setup
Here's the structure of table I'm getting a feed for:
CREATE TABLE IF NOT EXISTS data.edge_event (
id uuid,
inv_id uuid,
facility_id uuid,
from_node citext,
to_node citext,
from_to_node citext,
from_node_dts timestamp without time zone,
to_node_dts timestamp without time zone,
seconds integer,
cycle_id uuid
);
The duration is pre-calculated in seconds, and the area of interest for now is only the from_to_node name. So, it's fair to think of the example as
CREATE TABLE IF NOT EXISTS data.edge_event (
from_to_node citext,
seconds integer
);
Raw Data
Within the edge_event table, there are 159 distinct from_to_node values over around 300K event rows. Some are found in only a handful of edge_event records, some are found in thousands, or tens of thousands. That's too much to provide a good sample for. But to make the problem simpler to follow, a from_to_node might be
Boxing_Assembly 1256
Meaning "it took 1256 seconds to move this part from the Boxing phase to the Assembly phase." And here we might have 10,000 other records for "Boxing_Assembly" with different durations.
Goal
We're looking for two things out of each from_to_node. For something like Boxing_Assembly, I'm trying to do this:
Sort the seconds taken into buckets, say 20 buckets. This is for a histogram.
For each bucket get the
count of edge_event rows
avg(seconds) within the bucket
min/first_value(seconds) within the bucket
max/last_value(seconds) within the bucket
So, we're looking to chart durations to look for clusters, and then get the raw seconds out of any common clusters.
What I've tried
I've tried a lot of different code, and I've not succeeded. It seems like a problem for GROUP BY and/or window functions. There's something I'm not getting, as my results are far from the mark.
I know that I haven't provided sample data, which makes it harder to help. But I'm guessing that what I'm missing is one++ concepts. Pretty much, I want to know how to split out the edge_event data by from_to_node and then by seconds. Given the huge ranges across from_to_nodes, I'm trying to bucket each individually based on their own min/max.
Thanks very much for any help.
Draft Attempt
I've developed a query that works a bit, but not entirely. This is an edit from my original post with broken code.
WITH
min_max AS
(
SELECT from_to_node,
min(seconds),
max(seconds)
FROM edge_event
GROUP BY from_to_node
)
SELECT edge_event.from_to_node,
width_bucket (seconds, min_max.min, min_max.max, 99) as bucket, -- Bucket are counted from 0, so 9 gets you 10 buckets, if you have enough data.
count(*) as frequency,
min(seconds) as seconds_min,
max(seconds) as seconds_max,
max(seconds) - min(seconds) as bucket_width,
round(avg(seconds)) as seconds_avg
FROM edge_event
JOIN min_max ON (min_max.from_to_node = edge_event.from_to_node)
WHERE min_max.min <> min_max.max AND -- Can't have a bucket with an upper and lower bound that are the same.
edge_event.from_to_node IN (
'Boxing_Assembly',
'Assembly_Waiting For QA')
GROUP BY edge_event.from_to_node,
bucket
ORDER BY from_to_node,
bucket
What I'm getting back looks pretty good:
from_to_node bucket frequency seconds_min seconds_max bucket_width seconds_avg
Boxing_Assembly 1 912 17 7052 7035 3037
Boxing_Assembly 2 226 7058 13937 6879 9472
Boxing_Assembly 3 41 14151 21058 6907 16994
Boxing_Assembly 4 16 21149 27657 6508 23487
Boxing_Assembly 5 4 28926 33896 4970 30867
Boxing_Assembly 6 1 37094 37094 0 37094
Boxing_Assembly 7 1 43228 43228 0 43228
Boxing_Assembly 10 2 63666 64431 765 64049
Boxing_Assembly 14 1 94881 94881 0 94881
Boxing_Assembly 16 1 108254 108254 0 108254
Boxing_Assembly 37 1 257226 257226 0 257226
Boxing_Assembly 40 1 275140 275140 0 275140
Boxing_Assembly 68 1 471727 471727 0 471727
Boxing_Assembly 100 1 696732 696732 0 696732
Assembly_Waiting For QA 1 41875 1 18971 18970 726
Assembly_Waiting For QA 9 1 207457 207457 0 207457
Assembly_Waiting For QA 15 1 336711 336711 0 336711
Assembly_Waiting For QA 38 1 906519 906519 0 906519
Assembly_Waiting For QA 100 1 2369669 2369669 0 2369669
One problem here is that the buckets aren't evenly sized...they seem kind of weird. I've also tried specifying 10, 20, or 100 buckets, and get similar results. I'm hoping that there is a better way to allocate the data to buckets that I'm missing, and that there's a way to have zero-entry buckets instead of gaps.
I would use the PostgreSQL optimizer for that. It collects exactly the information you want.
Create a temporary table with the values you are interested in and ANALYZE it. Then look into pg_stats for the following:
if there are "most common values", you have them and their frequency right there.
Otherwise, look for adjacent histogram boundaries that are close together. Such a bucket is an interval where values are "lumped".

Alternative to dec2hex in MATLAB?

I am using dec2hex up to 100 times in MATLAB. Because of this, the speed of code decreases. for one point I am using dec2hex 100 times. It will take 1 minute or more than it. I have do the same for 5000 points. But because of dec2hex it will take hours of time to run. So how can I do hexadecimal to decimal conversion optimally? Is there any other alternative that can be used instead of dec2hex?
As example:
%%Data[1..256]: can be any data from
for i=1:1:256
Table=dec2hex(Data);
%%Some permutation applied on Data
end;
Here I am using dec2hex more than 100 times for one point. And I have to use it for 5000 points.
Data =
Columns 1 through 16
105 232 98 250 234 216 98 199 172 226 250 215 188 11 52 174
Columns 17 through 32
111 181 71 254 133 171 94 91 194 136 249 168 177 202 109 187
Columns 33 through 48
232 249 191 60 230 67 183 122 164 163 91 24 145 124 200 142
This kind of data My code will use.
Function calls are (still) expensive in MATLAB. This is one of the reasons why vectorization and pseudo-vectorization is strongly recommended: processing an entire array of N values in one function call is way better than calling the processing function N times for each element, thus saving the N-1 supplemental calls overhead.
So, what you can do? Here are some non-mutually-exclusive choices:
Profile your code first. Just because something looks like the main culprit for execution time disasters, it isn't necessarily it. Type profview in your command window, chose the script that you want to run, and see where are the hotspots of your code. Choose to optimize those hotspots rather than your initial guesses.
Try faster functions. sprintf is usually fast and flexible:
Table = sprintf('%04X\n', Data);
(and — if you dive into the function code with edit dec2hex — you'll see that in some cases dec2hex actually calls sprintf).
Reduce the number of function calls. Suppose you have to build the table for the 100 datasets of different lengths, that are stored in a cell array:
DataSet = cell(1,100);
for k = 1:100
DataSet{k} = fix(1000*rand(k,1));
end;
The idea is to assemble all the numbers in a single array that you convert at once:
Table = dec2hex(vertcat(DataSet{:}));
Mind you, this is done at the expense of using supplemental memory for assembling the partial inputs in a single one — it's not always convenient to do that.
All the variants above. Okay, this point is not actually a point. :-)

irregular time series data interpolation

i'm a newbie in Matlab. after using a specific application, i get a file which contains a data acceleration recorded every 160ms.
16 25 50 32 234 199 6
16 25 50 192 240 196 3
16 25 50 352 236 199 8
16 25 50 512 238 198 7
16 25 50 671 242 195 11
16 25 50 832 237 198 9
as we saw here that the interval value vary between +/- 160ms, its not fixed .
the 4 first column designed a 'data time series' and the rest designed a data acceleration.
here sample rate is not constant. so my goal is how can i get a data acceleration every 160ms.
i was thinking to resample data acceleration by interpolation.
first, i convert my data to seconds
s=data(:,3)+data(:,4)/1000; % convert to seconds+fractions
dt=diff(datenum(2013,1,1,data(:,1),data(:,2),s))*86400;
t= cumsum(diff(datenum(2014,06,09,data(:,1),data(:,2),s))*86400);
sample = interp1(t,data(:,5:end),[0:160:t(end)]);
is that correct?
thanks in advance
I'm not sure if this is what you're doing already with all that diff/cumsum stuff by I would think make t start at 0:
t = datenum(2013,1,1,data(:,1),data(:,2),s)*(24*60*60);
t = t-t(1);
sample = interp1(t,data(:,5:end), 0:0.16:t(end));
The idea here is that we know we want to sample every 0.16 seconds but only relative to the starting time. So if we reset the starting time to be 0, then we can just use 0:0.16:(end time - start time) as our sampling vector. The easiest way to make the start time 0 is to simply subtract the start time from the whole time vector, hence t = t - t(1). This also has the bonus effect of make t(end) equal the end time minus the start time.

Create intervals with 'datenum'

I am creating the limits for an equally spaced time series and I need to be able to change the time interval (1min, 5min, 10min, 15min, 30min, 60min etc.). My bounds are opening and closing time of the market. The stocks I am working on trades from 17.00 to 16.15 of the day after.
Here is what I am using:
timevec=datenum(2013,1,1,17,0:1*interval:1395,0)';
% It creates a time vector from 1-1-2013 17.00.00 to 1-2-2013 16.15.00
% spaced by "1min*interval"
The formula to be used is pretty simple but a problem arise if I need to use 10min or 30min as the result would be:
(10min)
02-Jan-2013 15:50:00
02-Jan-2013 16:00:00
02-Jan-2013 16:10:00
(30min)
02-Jan-2013 15:30:00
02-Jan-2013 16:00:00
What I would like to have is an extra interval 16:20:00 for the ten min case and 16:30:00 for the 30min case. The only solution I can come up with is moving the bound to 16:30 and adding an if statement to remove the extra observations in case they are not needed or keep the bound at 16:15:00 and adding an if statement to add the extra observation in case they are needed.
Is there anyway to do a one-line able to treat these two cases?
Matlab creates ranges such that all values are inside the limits. If you want to add one additional value right outside the limits, you can modify the upper limit by adding almost one interval length to the end:
step = 15;
1:step:100+0.99*step
ans =
1 16 31 46 61 76 91 106

Rearrange distribution function Matlab

I have the following data representing values over a 12 month period:
1. 0
2. 253
3. 168
4. 323
5. 556
6. 470
7. 225
8. 445
9. 98
10. 114
11. 381
12. 187
How can I smooth this line forward?
What I need is that going through the list sequentially any value that is above the mean (268) be evenly distributed amongst the remaining months- but in such a way that it produces as smooth a line as possible. I need to go through from Jan to Dec in order. Looking forward I want to sweep any excess (peaks) into the months still to come such that the distribution is as even as possible (such that the troughs are filled first). So the issue is to, at each point, determine what the "excess" for that month is and secondly how to distribute that amongst the months still to come.
I have used
p = find(Diff>0);
n = find(Diff<=0);
POS = Diff(p,1);
NEG = Diff(n,1)
to see where shortfall/ excesses against the mean exist but unsure how to construct a code that will redistribute forward by allocating to the "troughs" of the distribution first. An analogy is that these numbers represent harvest quantities and I must give out the harvest to the population. How do I redistribute the incoming harvest over the year such that I minimise excess supply/ under supply? I obviously cannot give out anything I haven't received in a particular month unless I have stored some harvest from previous months.
e.g. I start in Jan, I see that I cannot give anything to the months Feb to Dec so the value for Jan is 0. In Feb I have 253- do I adjust 253 downwards or give it all out? If so by how much? and where do I redistribute the excess I trim between Mar to Dec? And so on and so forth.. How do I do this to give as smooth (even) a distribution as possible?
For any month the new value assigned to that month cannot exceed the current value. The sum for the 12 months must be equal before and after smoothing. As the first position January will always be 0.
Simple version, just loops through and if the next month is lower than the current month, passes value forward to equalise them.
for n = 1:11
if y(n)>y(n+1);
y(n:n+1)=(y(n)+y(n+1))/2;
end
end
It's not very clear to me what you're asking...It sounds a bit like a roundabout way of asking how to fit a straight line to data. If that is the case, see below. Otherwise: please clarify a bit more what you want. Provide a toy example input data, and expected output data.
y = [ 0 253 168 323 556 470 225 445 98 114 381 187 ].';
x = (0:numel(y)-1).';
A = [ones(size(x)) x];
plot(...
x, y, 'b.',...
x, A*(A\y), 'r')
xlabel('Month'), ylabel('Data')
legend('original data', 'fit')
I dont get exactly what you want either, maybe something simple like this?
year= [0 253 168 323 556 470 225 445 98 114 381 187];
m= mean(year);
total_before = sum(year)
linear_year = linspace(0,m*2,12);
toal_after= sum(linear_year)
this gives you a line, the sum stays the same and the line is perfectly smooth ...