slow aggregate on multiple columns - kdb

This kdb query that aggregates multiple columns takes approximately 31 seconds compared to 3 seconds with J
Is there a faster way to do the sum in kdb?
Ultimately this will be running against a partitioned database on the 32-bit version
/test 1 - using symbols
n: 13000000;
cust: n?`8;
prod: n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
/query 1 - using simple by
q)\t select sum(v) by cust, prod from a
31058
/query 2 - grouping manually
\t {sum each x[`v][group[flip (x[`cust]; x[`prod])]]}(select v, cust, prod from a)
12887
/query 3 - simpler method of accessing
\t {sum each a.v[group x]} (flip (a.cust;a.prod))
11576
/test 2 - using strings, very slow
n: 13000000;
cust: string n?`8;
prod: string n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
q)\t select sum(v) by cust, prod from a
116745
comparison J code
n=:13000000
cust=: _8[\ a. {~ (65+?(8*n)#26)
prod=: _8[\ a. {~ (65+?(8*n)#26)
v=: ?.n#100
agg=: 3 : 0
keys=:i.~ |: i.~ every (cust;prod)
c=.((~.keys) { cust)
p=.((~.keys) { prod)
s=.keys +//. v
c;p;s
)
NB. 3.57 seconds
6!:2 'r=.agg 0'
3.57139
({.#$) every r
13000000 13000000 13000000
Update:
From the kdbplus forums, we can get down to about 2x the speed difference
q)\t r:(`cust`prod xkey a inds) + select sum v by cust,prod from a til[count a] except inds:(select cust,prod from a) ? d:distinct select cust,prod from a
6809
Update 2: added another dataset per #user3576050
This dataset has the same overall number of rows, but is distributed 4 instances per group
n: 2500000
g: 4
v: (g*n)?100
cust: (g*n)#(n?`8)
prod: (g*n)#(n?`8)
b:([]cust:cust; prod:prod ;v:v)
q)\ts select sum v by cust, prod from b
9737 838861968
The previous query runs poorly on the new dataset
q)\ts r:(`cust`prod xkey b inds) + select sum v by cust,prod from a til[count b] except inds:(select cust,prod from b) ? d:distinct select cust,prod from b
17181 671090384

If you update this data less frequently than you query it, how about pre-computing a group index? It’s about the same cost to create as a single query, and it allows querying at ~30x the speed.
q)\ts select sum v by cust,prod from b
14014 838861360
q)\ts update g:`g#{(key group x)?x}flip(cust;prod)from`b
14934 1058198384
q)\ts select first cust,first prod,sum v by g from b
473 201327488
q)
The results match up to row order and schema details:
q)(select sum v by cust,prod from b)~`cust`prod xasc 2!delete g from select first cust,first prod,sum v by g from b
1b
q)
(BTW, I know essentially nothing about J, but my guess would be that it’s computing a similar multi-column group index. q’s g index is unfortunately (currently?) limited to plain vector data—if it were possible to somehow apply it to the combination of cust and prod, I expect we’d see results like mine from the simple query.)

You are using a pathological dataset, a set of random symbols of length 8 will have few duplicates making the grouping redundant.
q)n:13000000; (count distinct n?`8)%n
0.9984848
p#/g# attributes(mentioned in comments above) will have no impact on performance for the same reasons.
You will see better performance with more appropriate data.
q)n:1000000
q)
q)a:([]cust:n?`8; prod:n?`8; v:n?100)
q)b:([]cust:n?`3; prod:n?`3; v:n?100)
q)
q)\ts select sum v by cust, prod from a
3779 92275568
q)
q)\ts select sum v by cust, prod from b
762 58786352

Related

What is the meaning of `s attribute on a table?

In the Abridged Q Language Manual Arthur mentioned:
`s#table marks the table to use binary search and marks first column sorted
And if we look into 3.6 version:
N:1000000;
t1:t2:([]n:til N; m:N?`6);
t1:update `p#n from t1;
t2:`s#t2;
(meta t1)[`n]`a / `p
(meta t2)[`n]`a / `p
attr t1 / `
attr t2 / `s
\ts:10000 select count i from t1 where n in 1000?N
/ ~7000
\ts:10000 select count i from t2 where n in 1000?N
/ ~7000
we find that yes, t2 has this attribute: s.
But for some reason an attribute on the first column is not s but p. And also search times are the same. And the sizes of both tables with attributes are the same - I used objsize function described in AquaQ blogpost to ensure.
So are there any differences in 3.6+ version of q between 's#table and a table with '#p attribute on a first column?
I think the only way that the s# on the table itself would improve search times is if you were doing lookups using ? as described here: https://code.kx.com/q/ref/find/#searching-tables
q)\ts:100000 t1?t1[0]
105 800
q)\ts:100000 t2?t2[0]
86 800
q)
q)\ts:100000 t1?t1[500000]
108 800
q)\ts:100000 t2?t2[500000]
83 800
q)
q)\ts:100000 t1?t1[999999]
107 800
q)\ts:100000 t2?t2[999999]
83 800
It behaves differently for a keyed table (turns it into a step function) but I think that's beyond the scope of your original question.

KDB/Q-sql Dynamic Grouping and con-canting columns in output

I have a table where I have to perform group by on dynamic columns and perform aggregation, result will be column values concatenating group-by tables and aggregations on col supplied by users.
For example :
g1 g2 g3 g4 col1 col2
A D F H 10 20
A E G I 11 21
B D G J 12 22
B E F L 13 23
C D F M 14 24
C D G M 15 25
and if I need to perform group by g1,g2,g4 and avg aggregation on col1 output should be like this
filed val
Avg[A-D-H-col1] 10.0
Avg[A-E-I-col1] 11.0
Avg[B-D-J-col1] 12.0
Avg[B-E-L-col1] 13.0
Avg[C-D-M-col1] 14.5
I am able to perform this if my group by columns are fixed using q-sql
t:([]g1:`A`A`B`B`C`C;g2:`D`E`D`E`D`D;g3:`F`G`G`F`F`G;g4:`H`I`J`L`M`M;col1:10 11 12 13 14 15;col2:20 21 22 23 24 25)
select filed:first ("Avg[",/:(({"-" sv x} each string (g1,'g2,'g4)),\:"-col1]")),val: avg col1 by g1,g2,g4 from t
I want to use functional query for the same , means I want a function which take list of group by columns, aggregation to perform and col name andtable name as input and output like above query. I can perform group by easily using dynamic columns but not able to con-cat in fields. function signature will be something like this
fun{[glist; agg; col,t] .. ;... }[g1g2g4;avg;col1,t]
Please help me to make above query as dynamic.
You may try following function:
specialGroup: {[glist;agg;col;table]
res: ?[table;();{x!x}glist; enlist[`val]!enlist(agg;col)];
aggname: string agg;
aggname: upper[1#aggname], 1_aggname;
res: ![res;();0b;enlist[`filed]!enlist({(y,"["),/:("-"sv/:string flip x),\:"]"};enlist,glist,enlist[enlist col];aggname)];
res
};
specialGroup[`g1`g2`g4;avg;`col1;t]
specialGroup aggregates values into val column first. And populates filed column after grouping. This helps to avoid generating filed duplicates and selecting first of them.
If you modify Anton's code to this it will change the output dynamically
specialGroup: {[glist;agg;col;table]
res: ?[table;();{x!x}glist; enlist[`val]!enlist(agg;col)];
res: ![res;();0b;enlist[`filed]!enlist({(#[string[y];0;upper],"["),/:("-"sv/:string flip x),\:"]"}[;agg];enlist,glist,enlist[enlist col])];
res
};
As the part of the code that made that string was inside another function you need to pass the agg parameter to the inner function.

Can PostgreSQL LAG() function refer to itself?

I've just discovered LAG() function in PostgreSQL and I've been experimenting to see what it can achieve. I've though that I might calculate factorial with it and I wrote
SELECT i, i * lag(factorial, 1, 1) OVER (ORDER BY i, 1) as factorial FROM generate_series(1, 10) as i;
But online IDE complains that 42703 column "factorial" does not exist.
Is there any way I can access the result of previous LAG call?
You can't refer to the column recursively in its definition.
However, you can express the factorial calculation as:
SELECT i, EXP(SUM(LN(i)) OVER w)::int factorial
FROM generate_series(1, 10) i
WINDOW w AS (ORDER BY i ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
-- outputs:
i | factorial
----+-----------
1 | 1
2 | 2
3 | 6
4 | 24
5 | 120
6 | 720
7 | 5040
8 | 40320
9 | 362880
10 | 3628800
(10 rows)
Postgresql does support an advanced SQL feature called recursive query, which can also be used to express the factorial table recursively:
WITH RECURSIVE series AS (
SELECT i FROM generate_series(1, 10) i
)
, rec AS (
SELECT i, 1 factorial FROM series WHERE i = 1
UNION ALL
SELECT series.i, series.i * rec.factorial
FROM series
JOIN rec ON series.i = rec.i + 1
)
SELECT *
FROM rec;
what EXP(SUM(LN(i)) OVER w) does:
This exploits the mathematical identities that:
[1]: log(a * b * c) = log (a) + log (b) + log (c)
[2]: exp (log a) = a
[combining 1&2]: exp(log a + log b + log c) = a * b * c
SQL does not have an aggregate multiply operation, so to perform an aggregate multiply operation, we first have to take the log of each value, then we can use the sum aggregate function to give us the the log of the values' product. This we invert with the final exponentiation step.
This works as long as the values being multiplied are positive as log is undefined for 0 and negative numbers. If you have negative numbers, or zero, the trick is to check if any value is 0, then the whole aggregation is 0, and check if the number of negative values is even, then the result is positive, else it is negative. Alternatively, you could also convert the reals to the complex plane and then use the identity Log(z) = ln(r) - iπ
what ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW does
This declares an expanding window frame that includes all preceding rows, and the current row.
e.g.
when i equals 1 the values in this window frame are {1}
when i equals 2 the values in this window frame are {1,2}
when i equals 3 the values in this window frame are {1,2,3}
what is a recursive query
A recursive query lets you express recursive logic using SQL. Recursive queries are often used to generate parent-child relationships from relational data (think manager-report, or product classification hierarchy), but they can generally be used to query any tree like structure.
Here is a SO answer I wrote a while back that illustrates and explains some of the capabilities of recursive queries.
There are also a tonne of useful tutorials on recursive queries. It is a very powerful sql-language feature and solves a type of problem that are very difficult do do without recursion.
Hope this gives you more insight into what the code does. Happy learning!

kdb - Nice way of ordering symbols

I have a kdb table containing future data, with the months to contract expiry are stored as symbols like `M0`M1`M2... etc. I want to order this based on expiry to get a list like `M1``M2`M3 etc but when I use asc I get `M1`M11`M12...`M2`21... etc. I suppose one way to achieve my goal, is to strip of the M cast to integer, sort and then recast back to string, add the M back, and then cast to symbol. But this seems like a long winded way. I was just wondering if there was a better approach?
I think I have replicated a simple version of your problem with:
q)t:([] a:`a`b`c`d`e`f; b:`M1`M4`M2`M21`M12`M11)
q)`num xasc update num: "I"$1_'string b from t
a b num
---------
a M1 1
c M2 2
b M4 4
f M11 11
e M12 12
d M21 21
I just made a new column which extracts the integer of b and ascends the table using this column. You would then be able to delete this column as well if you like using something like
delete num from `num xasc update num: "I"$1_'string b from t
to return your desired table.
Note: this solution assumes that the form of the months to expiry column is always M(months)
A more concise method could be converting b to bytes with -8! by using something like:
q)`num xasc update num:-8!'b from t
a b num
----------------------------------
a M1 0x010000000c000000f54d3100
c M2 0x010000000c000000f54d3200
b M4 0x010000000c000000f54d3400
f M11 0x010000000d000000f54d313100
e M12 0x010000000d000000f54d313200
d M21 0x010000000d000000f54d323100
but this method would be just a bit slower.

Can (aggregate) functions be used to define a column?

Assume a table like this one:
a | b | total
--|---|------
1 | 2 | 3
4 | 7 | 11
…
CREATE TEMPORARY TABLE summedup (
a double precision DEFAULT 0
, b double precision DEFAULT 0
--, total double precision
);
INSERT INTO summedup (a, b) VALUES (1, 2);
INSERT INTO summedup (a, b) VALUES (4, 7);
SELECT a, b, a + b as total FROM summedup;
It's easy to sum up the first two columns on SELECT.
But does Postgres (9.6) also support the ability to define total as the sum of the other two columns? If so:
What is the syntax?
What is this type of operation called (aggregates typically sum up cells over multiple rows, not columns.)
What you are looking for is typically called a "computed column".
Postgres 9.6 does not support that (Postgres 12 - to be released in Q4 2019 - will).
But for such a simple sum, I wouldn't bother storing redundant information.
If you don't want to repeat the expression, create a view.
I think what you want is a View.
CREATE VIEW table_with_sum AS
SELECT id, a, b, a + b as total FROM summedup;
then you can query the view for the sum.
SELECT total FROM table_with_sum where id=5;
The View does not store the sum for each row, the totalcolumn is computed every time you query the View. If your goal is to make your query more efficient, this will not help.
There is an other way: add the column to the table and create triggers for update and insert that update the total column every time a row is modified.