KDB/Q-sql Dynamic Grouping and con-canting columns in output - kdb

I have a table where I have to perform group by on dynamic columns and perform aggregation, result will be column values concatenating group-by tables and aggregations on col supplied by users.
For example :
g1 g2 g3 g4 col1 col2
A D F H 10 20
A E G I 11 21
B D G J 12 22
B E F L 13 23
C D F M 14 24
C D G M 15 25
and if I need to perform group by g1,g2,g4 and avg aggregation on col1 output should be like this
filed val
Avg[A-D-H-col1] 10.0
Avg[A-E-I-col1] 11.0
Avg[B-D-J-col1] 12.0
Avg[B-E-L-col1] 13.0
Avg[C-D-M-col1] 14.5
I am able to perform this if my group by columns are fixed using q-sql
t:([]g1:`A`A`B`B`C`C;g2:`D`E`D`E`D`D;g3:`F`G`G`F`F`G;g4:`H`I`J`L`M`M;col1:10 11 12 13 14 15;col2:20 21 22 23 24 25)
select filed:first ("Avg[",/:(({"-" sv x} each string (g1,'g2,'g4)),\:"-col1]")),val: avg col1 by g1,g2,g4 from t
I want to use functional query for the same , means I want a function which take list of group by columns, aggregation to perform and col name andtable name as input and output like above query. I can perform group by easily using dynamic columns but not able to con-cat in fields. function signature will be something like this
fun{[glist; agg; col,t] .. ;... }[g1g2g4;avg;col1,t]
Please help me to make above query as dynamic.

You may try following function:
specialGroup: {[glist;agg;col;table]
res: ?[table;();{x!x}glist; enlist[`val]!enlist(agg;col)];
aggname: string agg;
aggname: upper[1#aggname], 1_aggname;
res: ![res;();0b;enlist[`filed]!enlist({(y,"["),/:("-"sv/:string flip x),\:"]"};enlist,glist,enlist[enlist col];aggname)];
res
};
specialGroup[`g1`g2`g4;avg;`col1;t]
specialGroup aggregates values into val column first. And populates filed column after grouping. This helps to avoid generating filed duplicates and selecting first of them.

If you modify Anton's code to this it will change the output dynamically
specialGroup: {[glist;agg;col;table]
res: ?[table;();{x!x}glist; enlist[`val]!enlist(agg;col)];
res: ![res;();0b;enlist[`filed]!enlist({(#[string[y];0;upper],"["),/:("-"sv/:string flip x),\:"]"}[;agg];enlist,glist,enlist[enlist col])];
res
};
As the part of the code that made that string was inside another function you need to pass the agg parameter to the inner function.

Related

How do I select by month in KDB?

I have a table of the form
t r v
-------------------------------------------------
2016.01.04D09:51:00.000000000 -0.01507338 576
2016.01.04D09:52:00.000000000 -0.001831502 200
2016.01.04D11:37:00.000000000 -0.001100514 583
2016.01.04D12:04:00.000000000 -0.001653045 1000
I want to get the October 2020 values.
I tried doing a query:
select from x where t.month = 2020.10
but this didn't work. I think I might need to cast a date type? What am I doing wrong?
The trailing m allows the interpreter know that the atom is of month type instead of float type.
q)type 2020.10
-9h
q)type 2020.10m
-13h
q)select from x where t.month=2020.10
t
-
q)select from x where t.month=2020.10m
t
-----------------------------
2020.10.20D20:20:00.000000000

kdb - Nice way of ordering symbols

I have a kdb table containing future data, with the months to contract expiry are stored as symbols like `M0`M1`M2... etc. I want to order this based on expiry to get a list like `M1``M2`M3 etc but when I use asc I get `M1`M11`M12...`M2`21... etc. I suppose one way to achieve my goal, is to strip of the M cast to integer, sort and then recast back to string, add the M back, and then cast to symbol. But this seems like a long winded way. I was just wondering if there was a better approach?
I think I have replicated a simple version of your problem with:
q)t:([] a:`a`b`c`d`e`f; b:`M1`M4`M2`M21`M12`M11)
q)`num xasc update num: "I"$1_'string b from t
a b num
---------
a M1 1
c M2 2
b M4 4
f M11 11
e M12 12
d M21 21
I just made a new column which extracts the integer of b and ascends the table using this column. You would then be able to delete this column as well if you like using something like
delete num from `num xasc update num: "I"$1_'string b from t
to return your desired table.
Note: this solution assumes that the form of the months to expiry column is always M(months)
A more concise method could be converting b to bytes with -8! by using something like:
q)`num xasc update num:-8!'b from t
a b num
----------------------------------
a M1 0x010000000c000000f54d3100
c M2 0x010000000c000000f54d3200
b M4 0x010000000c000000f54d3400
f M11 0x010000000d000000f54d313100
e M12 0x010000000d000000f54d313200
d M21 0x010000000d000000f54d323100
but this method would be just a bit slower.

Cumulative function in spark scala

I have tried this to calculate cumulate value but if the date field is same those values are added in the cumulative field, can someone suggestion solution Similar to this question
val windowval = (Window.partitionBy($"userID").orderBy($"lastModified")
.rangeBetween(Window.unboundedPreceding, 0))
val df_w_cumsum = ms1_userlogRewards.withColumn("totalRewards", sum($"noOfJumps").over(windowval)).orderBy($"lastModified".asc)
df_w_cumsum.filter($"batchType".isNull).filter($"userID"==="355163").select($"userID", $"noOfJumps", $"totalRewards",$"lastModified").show()
Note that your very first totalRewards=147 is the sum of the previous value 49 + all the values with timestamp "2019-08-07 18:25:06": 49 + (36 + 0 + 60 + 2) = 147.
The first option would be to aggregate all the values with the same timestamp fist e.g. groupBy($"userId", $"lastModified").agg(sum($"noOfJumps").as("noOfJumps")) (or something like that) and then run your aggregate sum. This will remove duplicate timestamps altogether.
The second option is to use row_number to define an order among rows with the same lastModified field first and then run your aggregate sum with .orderBy($"lastModified, $"row_number") (or something like that). This should keep all records and give you partial sum up along the way: totalRewards = 49 -> 85 -> 85 -> 145 -> 147 (or something similar depending on the order defined by row_number)
I think you want to sum by userid and timestamp.
So, You need to partition by userid and date and use window function to sym like the following:
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("userID", "lastModified")
df.withColumn("cumulativeSum", sum(col("noOfJumps").over(window))

slow aggregate on multiple columns

This kdb query that aggregates multiple columns takes approximately 31 seconds compared to 3 seconds with J
Is there a faster way to do the sum in kdb?
Ultimately this will be running against a partitioned database on the 32-bit version
/test 1 - using symbols
n: 13000000;
cust: n?`8;
prod: n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
/query 1 - using simple by
q)\t select sum(v) by cust, prod from a
31058
/query 2 - grouping manually
\t {sum each x[`v][group[flip (x[`cust]; x[`prod])]]}(select v, cust, prod from a)
12887
/query 3 - simpler method of accessing
\t {sum each a.v[group x]} (flip (a.cust;a.prod))
11576
/test 2 - using strings, very slow
n: 13000000;
cust: string n?`8;
prod: string n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
q)\t select sum(v) by cust, prod from a
116745
comparison J code
n=:13000000
cust=: _8[\ a. {~ (65+?(8*n)#26)
prod=: _8[\ a. {~ (65+?(8*n)#26)
v=: ?.n#100
agg=: 3 : 0
keys=:i.~ |: i.~ every (cust;prod)
c=.((~.keys) { cust)
p=.((~.keys) { prod)
s=.keys +//. v
c;p;s
)
NB. 3.57 seconds
6!:2 'r=.agg 0'
3.57139
({.#$) every r
13000000 13000000 13000000
Update:
From the kdbplus forums, we can get down to about 2x the speed difference
q)\t r:(`cust`prod xkey a inds) + select sum v by cust,prod from a til[count a] except inds:(select cust,prod from a) ? d:distinct select cust,prod from a
6809
Update 2: added another dataset per #user3576050
This dataset has the same overall number of rows, but is distributed 4 instances per group
n: 2500000
g: 4
v: (g*n)?100
cust: (g*n)#(n?`8)
prod: (g*n)#(n?`8)
b:([]cust:cust; prod:prod ;v:v)
q)\ts select sum v by cust, prod from b
9737 838861968
The previous query runs poorly on the new dataset
q)\ts r:(`cust`prod xkey b inds) + select sum v by cust,prod from a til[count b] except inds:(select cust,prod from b) ? d:distinct select cust,prod from b
17181 671090384
If you update this data less frequently than you query it, how about pre-computing a group index? It’s about the same cost to create as a single query, and it allows querying at ~30x the speed.
q)\ts select sum v by cust,prod from b
14014 838861360
q)\ts update g:`g#{(key group x)?x}flip(cust;prod)from`b
14934 1058198384
q)\ts select first cust,first prod,sum v by g from b
473 201327488
q)
The results match up to row order and schema details:
q)(select sum v by cust,prod from b)~`cust`prod xasc 2!delete g from select first cust,first prod,sum v by g from b
1b
q)
(BTW, I know essentially nothing about J, but my guess would be that it’s computing a similar multi-column group index. q’s g index is unfortunately (currently?) limited to plain vector data—if it were possible to somehow apply it to the combination of cust and prod, I expect we’d see results like mine from the simple query.)
You are using a pathological dataset, a set of random symbols of length 8 will have few duplicates making the grouping redundant.
q)n:13000000; (count distinct n?`8)%n
0.9984848
p#/g# attributes(mentioned in comments above) will have no impact on performance for the same reasons.
You will see better performance with more appropriate data.
q)n:1000000
q)
q)a:([]cust:n?`8; prod:n?`8; v:n?100)
q)b:([]cust:n?`3; prod:n?`3; v:n?100)
q)
q)\ts select sum v by cust, prod from a
3779 92275568
q)
q)\ts select sum v by cust, prod from b
762 58786352

Perl + PostgreSQL-- Selective Column to Row Transpose

I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67
You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.
SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.