Sliding window on time in KDB/Q - aggregate

There are some functions in Q/KDB that let us aggregate on a sliding window (msum, mavg, etc.). But these functions takes the number of previous rows into account.
I'd like a function that would aggregate on a sliding window but with time instead of number of rows. For example on the last 5 minutes.
Do such functions exist? If not, how can I design it? I don't want to use a while loop, as it will slow down my program too much because of the huge amount of data.
Thank you for your help

do you want to aggregate to fixed time buckets by and xbar are your friend: http://code.kx.com/q/ref/arith-integer/#xbar
trade: ([] time:`time$(10:00 10:01 10:03 10:07 10:09); price:`float$(12.1 12.6 12.4 12.4 12.9); size:`int$(5 6 10 34 2))
select last price, sum size by 5 xbar time.minute from trade
minute| price size
------| ----------
10:00 | 12.4 21
10:05 | 12.9 36
if you want to go back 5 minutes in time for every row a window join is what your are looking for: http://code.kx.com/q/ref/joins/#wj-wj1-window-join
w:-300000 0+\:trade.time
wj1[w;`time;trade;(trade;(last;`price);(sum;`size))]
time price size
-----------------------
10:00:00.000 12.1 5
10:01:00.000 12.6 11
10:03:00.000 12.4 21
10:07:00.000 12.4 44
10:09:00.000 12.9 36

Related

Is the complexity of kdb's moving max function mmax O(n)?

I used function mmax to calculate moving max of a 10-million-length integer vector. I ran it 10 times to calculate the total execution time. The running time for window size 132 (15,025 milliseconds) is 6 times longer than for window size 22 (2,425 milliseconds). It seems the complexity of mmax is O(nw) rather than O(n), where w is the length of the sliding window).
To check if this is true for other similar products, I tried the same experiment on DolphinDB, a time series database with built-in analytics features (https://www.dolphindb.com/downloads.html ). In contrast, DolphinDB’s mmax has linear complexity O(n), regardless of the window size: 1,277 milliseconds (window size 132) and 1,233 milliseconds (window size 22).
The hardware being used for this experiment:
Server: Dell PowerEdge R630
Architecure: x86_64
CPU Model Name: Intel(R) Xeon(R) CPU E5-2650 v4 # 2.20GHz
Total logical CPU cores: 48
Total memory: 256G
Experiment setup
I used KDB+ 4.0 64 bit version and DolphinDB_Linux_V2.00.7(DolphinDB community version: 2 cores and 8GB memory). Both experiments are conducted using 2 cores of CPU.
KDB implementation
// Start the server
rlwrap -r taskset -c 0,1 ./l64/q -p 5002 -s 2
// The code
a:10000000?10000i
\t do[10; 22 mmax a]
2425
\t do[10; 132 mmax a]
15025
DolphinDB implementation
// Start the server
rlwrap -r ./dolphindb -localSite localhost:5002:local5002 -localExecutors 1
// The code
a=rand(10000,10000000)
timer(10) mmax(a,22);
1232.83 ms
timer(10) mmax(a,132);
1276.53 ms
Can any kdb expert confirm the complexity of function mmax? If the built-in function mmax does have the complexity of O(nw), any third-party plugin for kdb to improve the performance?
Yes, it would scale with the size of the window as well as the size of the list. If you look at the definition of mmax:
q)mmax
k){(x-1)|':/y}
it is "equivalent" to
q)a:8 1 9 5 4 6 6 1 8 5
q)w:3
q)mmax[w;a]~{{max x,y}':[x]}/[w-1;a]
1b
which can more clearly be understood as the last output of:
q){{max x,y}':[x]}\[w-1;a]
8 1 9 5 4 6 6 1 8 5
8 8 9 9 5 6 6 6 8 8
8 8 9 9 9 6 6 6 8 8
....take the max of each item with its previous item, {max x,y}':[x]
....then do that same operation again on the output, {}\
....do the same operation again on the output (w-1) times, \[w-1;a]
From that it's clear that the window size impacts the number of times the loop is performed. As for faster implementation, there might be a different but less "elegant" algorithm which does it quicker and could be written in k/q. Otherwise you could import an implementation written in C - see https://code.kx.com/q/ref/dynamic-load/

Grouping data into buckets by frequency in Postgres 11.6

Using Postgres 11.6, I'm trying to analyze some event data. The goal is to find the durations for all events with a specific name, and then split each one out into evenly sized buckets. We're looking for any times that "clump" for a specific event. I'm editing my question as the specific case may be obscuring what I'm trying to ask.
Simple example
The question is "how do you group rows by a value, then split occurrences by frequency into buckets with count and average for each of those buckets." Here's a hand-done toy example with rounded averages:
Months with values, each number here represents a row.
Jan 12 24 60 150 320 488
Feb 8 16 40 100 220
Mar 4 8 20 310
Overall figures
Month Count Avg Min Max
Jan 6 176 12 488
Feb 5 77 8 220
Mar 4 86 4 310
The same original data, but with more data, including repeated values and a wider range.
Jan 12 12 12 12 24 24 60 60 150 320 488 500
Feb 8 8 8 8 8 16 40 100 220 440 1100
Mar 4 8 8 8 8 20 20 20 20 310
Overall figures
Month Count Avg Min Max
Jan 12 140 12 500
Feb 11 178 8 1100
Mar 10 43 4 310
Mock-up of one of the sets of data split out into 3 buckets
Month Count Avg Min Max Bucket
Jan 4 12 12 12 0
Jan 4 42 24 60 1
Jan 4 365 150 500 2
...and so on for Feb and Mar
I'm just guessing at how the buckets would split in the mock-up above.
That pretty much captures what I'm trying to do. Group by month name (from_to_node in my real case), split the resulting rows into buckets, and then get min, max, avg, and count for each bucket. It's starting to sound like a pivot (?)
Real Table Setup
Here's the structure of table I'm getting a feed for:
CREATE TABLE IF NOT EXISTS data.edge_event (
id uuid,
inv_id uuid,
facility_id uuid,
from_node citext,
to_node citext,
from_to_node citext,
from_node_dts timestamp without time zone,
to_node_dts timestamp without time zone,
seconds integer,
cycle_id uuid
);
The duration is pre-calculated in seconds, and the area of interest for now is only the from_to_node name. So, it's fair to think of the example as
CREATE TABLE IF NOT EXISTS data.edge_event (
from_to_node citext,
seconds integer
);
Raw Data
Within the edge_event table, there are 159 distinct from_to_node values over around 300K event rows. Some are found in only a handful of edge_event records, some are found in thousands, or tens of thousands. That's too much to provide a good sample for. But to make the problem simpler to follow, a from_to_node might be
Boxing_Assembly 1256
Meaning "it took 1256 seconds to move this part from the Boxing phase to the Assembly phase." And here we might have 10,000 other records for "Boxing_Assembly" with different durations.
Goal
We're looking for two things out of each from_to_node. For something like Boxing_Assembly, I'm trying to do this:
Sort the seconds taken into buckets, say 20 buckets. This is for a histogram.
For each bucket get the
count of edge_event rows
avg(seconds) within the bucket
min/first_value(seconds) within the bucket
max/last_value(seconds) within the bucket
So, we're looking to chart durations to look for clusters, and then get the raw seconds out of any common clusters.
What I've tried
I've tried a lot of different code, and I've not succeeded. It seems like a problem for GROUP BY and/or window functions. There's something I'm not getting, as my results are far from the mark.
I know that I haven't provided sample data, which makes it harder to help. But I'm guessing that what I'm missing is one++ concepts. Pretty much, I want to know how to split out the edge_event data by from_to_node and then by seconds. Given the huge ranges across from_to_nodes, I'm trying to bucket each individually based on their own min/max.
Thanks very much for any help.
Draft Attempt
I've developed a query that works a bit, but not entirely. This is an edit from my original post with broken code.
WITH
min_max AS
(
SELECT from_to_node,
min(seconds),
max(seconds)
FROM edge_event
GROUP BY from_to_node
)
SELECT edge_event.from_to_node,
width_bucket (seconds, min_max.min, min_max.max, 99) as bucket, -- Bucket are counted from 0, so 9 gets you 10 buckets, if you have enough data.
count(*) as frequency,
min(seconds) as seconds_min,
max(seconds) as seconds_max,
max(seconds) - min(seconds) as bucket_width,
round(avg(seconds)) as seconds_avg
FROM edge_event
JOIN min_max ON (min_max.from_to_node = edge_event.from_to_node)
WHERE min_max.min <> min_max.max AND -- Can't have a bucket with an upper and lower bound that are the same.
edge_event.from_to_node IN (
'Boxing_Assembly',
'Assembly_Waiting For QA')
GROUP BY edge_event.from_to_node,
bucket
ORDER BY from_to_node,
bucket
What I'm getting back looks pretty good:
from_to_node bucket frequency seconds_min seconds_max bucket_width seconds_avg
Boxing_Assembly 1 912 17 7052 7035 3037
Boxing_Assembly 2 226 7058 13937 6879 9472
Boxing_Assembly 3 41 14151 21058 6907 16994
Boxing_Assembly 4 16 21149 27657 6508 23487
Boxing_Assembly 5 4 28926 33896 4970 30867
Boxing_Assembly 6 1 37094 37094 0 37094
Boxing_Assembly 7 1 43228 43228 0 43228
Boxing_Assembly 10 2 63666 64431 765 64049
Boxing_Assembly 14 1 94881 94881 0 94881
Boxing_Assembly 16 1 108254 108254 0 108254
Boxing_Assembly 37 1 257226 257226 0 257226
Boxing_Assembly 40 1 275140 275140 0 275140
Boxing_Assembly 68 1 471727 471727 0 471727
Boxing_Assembly 100 1 696732 696732 0 696732
Assembly_Waiting For QA 1 41875 1 18971 18970 726
Assembly_Waiting For QA 9 1 207457 207457 0 207457
Assembly_Waiting For QA 15 1 336711 336711 0 336711
Assembly_Waiting For QA 38 1 906519 906519 0 906519
Assembly_Waiting For QA 100 1 2369669 2369669 0 2369669
One problem here is that the buckets aren't evenly sized...they seem kind of weird. I've also tried specifying 10, 20, or 100 buckets, and get similar results. I'm hoping that there is a better way to allocate the data to buckets that I'm missing, and that there's a way to have zero-entry buckets instead of gaps.
I would use the PostgreSQL optimizer for that. It collects exactly the information you want.
Create a temporary table with the values you are interested in and ANALYZE it. Then look into pg_stats for the following:
if there are "most common values", you have them and their frequency right there.
Otherwise, look for adjacent histogram boundaries that are close together. Such a bucket is an interval where values are "lumped".

Tableau: Calculated Field based on Dimension and Measure

I'm trying to set up a look at YoY based on quarters, thus (Q1 2016 Rev/Q2 2015 Rev) - 1.
My data is in quarters, so I'm trying to set up a calculated field with (Rev at Current Quarter / Rev at (Current Quarter - 4)) - 1
But I'm not sure how to set up that dependency in Tableau.
Thanks for reading
EDIT:
Example of Data
quarter_id | quarter_revenue
10 | 200
11 | 430
12 | 250
13 | 300
14 | 405
15 | 493
16 | 299
So quarter_id 10 corresponds to 2015 Q1, then 11 is 2015 Q2, etc. Currently I can set this into Tableau and use Quick Table Calculation: Percent Difference on Quarter_Revenue which gets me the difference for id 11 and 10 (2015 Q2 and 2015 Q1).
What I want to do is look a year ahead however, and do this calculation 4 quarters ahead. So to compare 2015 Q1 vs 2016 Q1, I would need to do look at id 14 and id 10, and the calculation for Percent Difference would be (405/200)-1.

remove a lesser duplicate

In KDB, I have the following table:
q)tab:flip `items`sales`prices!(`nut`bolt`cam`cog`bolt`screw;6 8 0 3 0n 0n;10 20 15 20 0n 0n)
q)tab
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
bolt
screw
In this table, there are 2 duplicate items (bolt). However since the first 'bolt' contains more information. I would like to remove the 'lesser' bolt.
FINAL RESULT:
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
screw
As far as I understand, If I used the 'distinct' function its not deterministic?
One way to do it is to fill forward by item, then bolt will inherit the previous values.
q)update fills sales,fills prices by items from tab
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
bolt 8 20
screw
This can also be done in functional form where you can pass the table and by columns:
{![x;();(!). 2#enlist(),y;{x!fills,/:x}cols[x]except y]}[tab;`items]
If "more information" means "least nulls" then you could count the number of nulls in each row and only return those rows by item that contain the fewest:
q)select from #[tab;`n;:;sum each null tab] where n=(min;n)fby items
items sales prices n
--------------------
nut 6 10 0
bolt 8 20 0
cam 0 15 0
cog 3 20 0
screw 2
Although would not recommend this approach as it requires working with rows rather than columns.
Because those two rows contain different data, they are considered distinct.
It depends on how you define "more information". You would probably need to provide more examples, but some possibilities:
Delete rows with null sales value
q)delete from tab where null sales
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20
Retrieve rows with max sales value for each item
q)select from tab where (sales*prices) = (max;sales*prices) fby items
items sales prices
------------------
nut 6 10
bolt 8 20
cam 0 15
cog 3 20

crystal crosstab need to limit columns # of columns

I am using a Crystal crosstab. My rows are lab results and my columns are dates. I am sorting the dates in descending order so that the most current dates are listed first. I know I can use the TopN formula for rows to limit to a certain number of rows but I need to limit it to a certain number of columns preferably 10. In the example below I would not want to show anything after 10/10/11.
10/1/12 9/3/12 7/16/12 5/8/12 4/22/12 3/17/12 1/9/12 12/3/11 11/15/11 10/10/11 9/23/11 8/18/11 7/7/11 6/8/11
Calcium 8.5 9 9.1 9 8.9 8.9 9 9 9 9 9 9 8.9 9
Vitamin D 45 45 51 49 56 50 51 55 60 66 60 59 60 61
Any guidance would be greatly appreciated.
Thanks
Jill
I think the cross tab can only limit columns if the names are specified which will not be possible with dates.
There are two possible work arounds I can think of:
1 - limit via the query:
Go to Report > Select Expert > Record and select the date field, click formula then add this formula (for 10/10/2011):
{Mytable.DateField} < Date (2011, 10, 10)
or for a dynamic date (older than 1 year):
{Mytable.DateField} < DateAdd ("yyyy", -1, CurrentDate)
2 - The other option is to create cross-tab as a standard report, this means the dates will be vertical rather than horizontal.
You can add a group to the report by date then add values for each type as summaries, let me know if you prefer this and I can explain in more detail.