kdb - Nice way of ordering symbols - kdb

I have a kdb table containing future data, with the months to contract expiry are stored as symbols like `M0`M1`M2... etc. I want to order this based on expiry to get a list like `M1``M2`M3 etc but when I use asc I get `M1`M11`M12...`M2`21... etc. I suppose one way to achieve my goal, is to strip of the M cast to integer, sort and then recast back to string, add the M back, and then cast to symbol. But this seems like a long winded way. I was just wondering if there was a better approach?

I think I have replicated a simple version of your problem with:
q)t:([] a:`a`b`c`d`e`f; b:`M1`M4`M2`M21`M12`M11)
q)`num xasc update num: "I"$1_'string b from t
a b num
---------
a M1 1
c M2 2
b M4 4
f M11 11
e M12 12
d M21 21
I just made a new column which extracts the integer of b and ascends the table using this column. You would then be able to delete this column as well if you like using something like
delete num from `num xasc update num: "I"$1_'string b from t
to return your desired table.
Note: this solution assumes that the form of the months to expiry column is always M(months)
A more concise method could be converting b to bytes with -8! by using something like:
q)`num xasc update num:-8!'b from t
a b num
----------------------------------
a M1 0x010000000c000000f54d3100
c M2 0x010000000c000000f54d3200
b M4 0x010000000c000000f54d3400
f M11 0x010000000d000000f54d313100
e M12 0x010000000d000000f54d313200
d M21 0x010000000d000000f54d323100
but this method would be just a bit slower.

Related

Select from a table with Limit expression works, without - fails

For a table t with a custom field c which is dictionary I could use select with limit expression, but simple select failes:
q)r1: `n`m`k!111b;
q)r2: `n`m`k!000b;
q)t: ([]a:1 2; b:10 20; c:(r1; r2));
q)t
a b c
----------------
1 10 `n`m`k!111b
2 20 `n`m`k!000b
q)select[2] c[`n] from t
x
-
1
0
q)select c[`n] from t
'type
[0] select c[`n] from t
^
Is it a bug, or am I missing something?
Upd:
Why does select [2] c[`n] from t work here?
Since c is a list, it does not support key indexing which is why it has returned a type
You need to index into each element instead of trying to index the column.
q)select c[;`n] from t
x
-
1
0
A list of confirming dictionaries outside of this context is equivalent to a table, so you can index like you were
q)c:(r1;r2)
q)type c
98h
q)c[`n]
10b
I would say that the way complex columns are represented in memory makes this not possible. I suspect that any modification that creates a copy of a subset of the elements will allow column indexing as the copy will be formatted as a table.
One example here is serialising and deserialising the column (not recommended to do this). In the case of select[n] it is selecting a subset of 2 elements
q)type exec c from t
0h
q)type exec -9!-8!c from t
98h
q)exec (-9!-8!c)[`n] from t
10b

Can (aggregate) functions be used to define a column?

Assume a table like this one:
a | b | total
--|---|------
1 | 2 | 3
4 | 7 | 11
…
CREATE TEMPORARY TABLE summedup (
a double precision DEFAULT 0
, b double precision DEFAULT 0
--, total double precision
);
INSERT INTO summedup (a, b) VALUES (1, 2);
INSERT INTO summedup (a, b) VALUES (4, 7);
SELECT a, b, a + b as total FROM summedup;
It's easy to sum up the first two columns on SELECT.
But does Postgres (9.6) also support the ability to define total as the sum of the other two columns? If so:
What is the syntax?
What is this type of operation called (aggregates typically sum up cells over multiple rows, not columns.)
What you are looking for is typically called a "computed column".
Postgres 9.6 does not support that (Postgres 12 - to be released in Q4 2019 - will).
But for such a simple sum, I wouldn't bother storing redundant information.
If you don't want to repeat the expression, create a view.
I think what you want is a View.
CREATE VIEW table_with_sum AS
SELECT id, a, b, a + b as total FROM summedup;
then you can query the view for the sum.
SELECT total FROM table_with_sum where id=5;
The View does not store the sum for each row, the totalcolumn is computed every time you query the View. If your goal is to make your query more efficient, this will not help.
There is an other way: add the column to the table and create triggers for update and insert that update the total column every time a row is modified.

How to pass dictionary into query constraint?

If this is the dictionary of constraint:
dictName:`region`Code;
dictValue:(`NJ`NY;`EEE213);
dict:dictName!dictValue;
I would like to pass the dict to a function and depending on how many keys there are and let the query react accordingly. If there is one key region, then I would like to put it as
select from table where region in dict`region;
The same thing is for code. But if I pass two keys, I would like the query knows and pass it as:
select form table where region in dict`region,Code in dict`code;
Is there any way to do this?
I came up this code:
funcForOne:{[constraint]?[`bce;enlist(in;constraint;(`dict;enlist constraint));0b;()]};
funcForAll[]
{[dict]$[(null dict)~1;select from bce;($[(count key dict)=1;($[`region in (key dict);funcForOne[`region];funcForOne[`Code]]);select from bce where region in dict`region,rxmCode in dict`Code])]};
It works for one and two constraint. but when I called funcForAll[] it gives type error. How should I change it? i think it is from null dict~1
I tried count too. but doesn't work too well.
Update
So I did this but I have some error
tab:([]code:`B90056`B90057`B90058`B90059;region:`CA`NY`NJ`CA);
dictKey:`region`Code;dictValue:(`NJ`NY;`B90057);
dict:dictKey!dictValue;
?[tab;f dict;0b;()];
and I got 'NY error. Do you know why? Also,if I pass a null dictionary it doesn't seem working.
As I said funtional form would be the better approach but if your requirement is very limited as you said then you can consider other solution as below:
Note: Assuming all dictionary keys will be in table columns list.
q) f:{[dict] if[0=count dict;:select from t];
select from t where (#[key dict;t]) in {$[any 0<=type each value x;flip ;enlist ]x}[dict] }
Explanation:
1. convert dict to table depending on the values type. Flip if any value is a general list else enlist.
$[any 0<=type each value dict;flip ;enlist ]dict
Get subset of table t which consists only of dictionary keys as columns.
#[key dict;t]
get rows where (2) in (1)
Basically we are using below form of querying and matching:
q)t1:([]id:1 2;s:`a`b);
q)t2:([]id:1 3 ;s:`a`b);
q)select from t1 where ([]id;s) in t2
If you're just using in, you can do something like:
f:{{[x;y](in),'key[y],'(),x}[;x]enlist each value[x]}
So that:
q)d
a| 10 1
b| ,`a
q)f d
in `a 10 1
in `b ,`a
q)t
a b c
------
1 a 10
2 b 20
3 c 30
q)?[t;f d;0b;()]
a b c
------
1 a 10
Note that because of the enlist each the resulting list is enlisted so that singletons work too:
q)d:enlist[`a]!enlist 1
q)d
a| 1
q)?[t;f d;0b;()]
a b c
------
1 a 10
Update to secondary question
This still works with empty dict, i.e. ()!(). I'm passing in the dictionary variable.
In your 2nd question your dictionary is not constructed correctly (also remember q is case sensitive). Also your values need to be enlisted. Look up functional select in the reference pages on the kx site, you'll see that you need to enlist the symbol lists to differentiate them from column name declarations
`region`code!(enlist `NY`NJ;enlist `B90057)

slow aggregate on multiple columns

This kdb query that aggregates multiple columns takes approximately 31 seconds compared to 3 seconds with J
Is there a faster way to do the sum in kdb?
Ultimately this will be running against a partitioned database on the 32-bit version
/test 1 - using symbols
n: 13000000;
cust: n?`8;
prod: n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
/query 1 - using simple by
q)\t select sum(v) by cust, prod from a
31058
/query 2 - grouping manually
\t {sum each x[`v][group[flip (x[`cust]; x[`prod])]]}(select v, cust, prod from a)
12887
/query 3 - simpler method of accessing
\t {sum each a.v[group x]} (flip (a.cust;a.prod))
11576
/test 2 - using strings, very slow
n: 13000000;
cust: string n?`8;
prod: string n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
q)\t select sum(v) by cust, prod from a
116745
comparison J code
n=:13000000
cust=: _8[\ a. {~ (65+?(8*n)#26)
prod=: _8[\ a. {~ (65+?(8*n)#26)
v=: ?.n#100
agg=: 3 : 0
keys=:i.~ |: i.~ every (cust;prod)
c=.((~.keys) { cust)
p=.((~.keys) { prod)
s=.keys +//. v
c;p;s
)
NB. 3.57 seconds
6!:2 'r=.agg 0'
3.57139
({.#$) every r
13000000 13000000 13000000
Update:
From the kdbplus forums, we can get down to about 2x the speed difference
q)\t r:(`cust`prod xkey a inds) + select sum v by cust,prod from a til[count a] except inds:(select cust,prod from a) ? d:distinct select cust,prod from a
6809
Update 2: added another dataset per #user3576050
This dataset has the same overall number of rows, but is distributed 4 instances per group
n: 2500000
g: 4
v: (g*n)?100
cust: (g*n)#(n?`8)
prod: (g*n)#(n?`8)
b:([]cust:cust; prod:prod ;v:v)
q)\ts select sum v by cust, prod from b
9737 838861968
The previous query runs poorly on the new dataset
q)\ts r:(`cust`prod xkey b inds) + select sum v by cust,prod from a til[count b] except inds:(select cust,prod from b) ? d:distinct select cust,prod from b
17181 671090384
If you update this data less frequently than you query it, how about pre-computing a group index? It’s about the same cost to create as a single query, and it allows querying at ~30x the speed.
q)\ts select sum v by cust,prod from b
14014 838861360
q)\ts update g:`g#{(key group x)?x}flip(cust;prod)from`b
14934 1058198384
q)\ts select first cust,first prod,sum v by g from b
473 201327488
q)
The results match up to row order and schema details:
q)(select sum v by cust,prod from b)~`cust`prod xasc 2!delete g from select first cust,first prod,sum v by g from b
1b
q)
(BTW, I know essentially nothing about J, but my guess would be that it’s computing a similar multi-column group index. q’s g index is unfortunately (currently?) limited to plain vector data—if it were possible to somehow apply it to the combination of cust and prod, I expect we’d see results like mine from the simple query.)
You are using a pathological dataset, a set of random symbols of length 8 will have few duplicates making the grouping redundant.
q)n:13000000; (count distinct n?`8)%n
0.9984848
p#/g# attributes(mentioned in comments above) will have no impact on performance for the same reasons.
You will see better performance with more appropriate data.
q)n:1000000
q)
q)a:([]cust:n?`8; prod:n?`8; v:n?100)
q)b:([]cust:n?`3; prod:n?`3; v:n?100)
q)
q)\ts select sum v by cust, prod from a
3779 92275568
q)
q)\ts select sum v by cust, prod from b
762 58786352

KDB: select rows based on value in one column being contained in the list of another column

very simple, silly question. Consider the following table:
tt:([]Id:`6`7`12 ;sym:`A`B`C;symlist:((`A`B`M);(`X`Y`Z);(`H`F`C)))
Id sym symlist
---------------
6 A `A`B`M
7 B `X`Y`Z
12 C `H`F`C
I would like to select all rows in tt where the element in sym is contained in the list symlist. In this case, it means just the first and third rows. However, the following query gives me a type error.
select from tt where sym in symlist
(`type)
Whats the proper way to do this? Thanks
You want to use the ' (each-both) adverb, so that they "pair up" so to speak. Recall that sym is just list, and symlist is a list of lists. You want to check each element in sym with the respective sub-list in symlist. You do this by telling it to "pair up".
q)tt:([]id:6712; sym:`A`B`C; symlist:(`A`B`M;`X`Y`Z;`H`F`C))
q)select from tt where sym in'symlist
id sym symlist
----------------
6712 A A B M
6712 C H F C
It's not entirely clear to me why your query results in a type error, so I'd be interested in hearing other people's responses.
q)select from tt where sym in symlist
'type
in
`A`B`C
(`A`B`M;`X`Y`Z;`H`F`C)
q)select from tt where {x in y}[sym;symlist]
id sym symlist
--------------
In reponse to JPCs answer (couldn't format this as a comment)....
Type error possibly caused by applying "where" to a scalar boolean
q)(`a`b`c) in (`a`g`b;`u`i`o;`g`c`t)
0b
q)where (`a`b`c) in (`a`g`b;`u`i`o;`g`c`t)
'type
Also, the reason the {x in y} lambda doesn't cause the error is because the "in" is obscured and is not visible to the parser (parser doesn't look inside lambdas)
q)0N!parse"select from tt where {x in y}[sym;symlist]";
(?;`tt;,,({x in y};`sym;`symlist);0b;())
Whereas the parser can "see" the "in" in the first case
q)0N!parse"select from tt where sym in symlist";
(?;`tt;,,(in;`sym;`symlist);0b;())
I'm guessing the parser tries to do some optimisations when it sees the "in"