kdb union join (with plus join) - kdb

I have been stuck on this for a while now, but cannot come up with a solution, any help would be appriciated
I have 2 table like
q)x
a b c d
--------
1 x 10 1
2 y 20 1
3 z 30 1
q)y
a b| c d
---| ----
1 x| 1 10
3 h| 2 20
Would like to sum the common columns and append the new ones. Expected result should be
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
pj looks to only update the (1,x) but doesn't insert the new (3,h). I am assuming there has to be a way to do some sort of union+plus join in kdb

You can take advantage of the plus (+) operator here by simply keying x and adding the table y to get the desired table:
q)(2!x)+y
a b| c d
---| -----
1 x| 11 11
2 y| 20 1
3 z| 30 1
3 h| 2 20
The same "plus if there's a matching key, insert if not" behaviour works for dictionaries too:
q)(`a`b!1 2)+`a`c!10 30
a| 11
b| 2
c| 30

got it :)
q) (x pj y), 0!select from y where not ([]a;b) in key 2!x
a b c d
--------
1 x 11 11
2 y 20 1
3 z 30 1
3 h 2 20
Always open for a better implementation :D I am sure there is one.

Related

Why is my conforming dictionary not getting turned into a table?

Let's say I have a table:
m:([] t: raze 3#'(2021.01.04+til 5); sym:15#`A`B`C; c: til 15)
t sym c
-----------------
2021.01.04 A 0
2021.01.04 B 1
2021.01.04 C 2
2021.01.05 A 3
2021.01.05 B 4
When I try to pivot it:
exec t!c by sym:sym from m
sym|
---| -----------------------------------------------------------------
A | 2021.01.04 2021.01.05 2021.01.06 2021.01.07 2021.01.08!0 3 6 9 12
B | 2021.01.04 2021.01.05 2021.01.06 2021.01.07 2021.01.08!1 4 7 10 13
C | 2021.01.04 2021.01.05 2021.01.06 2021.01.07 2021.01.08!2 5 8 11 14
I'd expect to get a table back, with columns sym, but I don't. What am I doing wrong?
if you're after a pivot with columns of sym you would want the following:
q)exec sym!c by t:t from m
t | A B C
----------| --------
2021.01.04| 0 1 2
2021.01.05| 3 4 5
2021.01.06| 6 7 8
2021.01.07| 9 10 11
2021.01.08| 12 13 14
It's because your column names have to be symbols:
q)exec(`$string t)!c by sym:sym from m
sym| 2021.01.04 2021.01.05 2021.01.06 2021.01.07 2021.01.08
---| ------------------------------------------------------
A | 0 3 6 9 12
B | 1 4 7 10 13
C | 2 5 8 11 14
These would be terrible column names though, so I would use .Q.id
q).Q.id exec(`$string t)!c by sym:sym from m
sym| a20210104 a20210105 a20210106 a20210107 a20210108
---| -------------------------------------------------
A | 0 3 6 9 12
B | 1 4 7 10 13
C | 2 5 8 11 14
It sounds like this isn't what you actually want though, so maybe Matthews answer is more relevant. My answer just explains why it didn't look like what you thought

summarise (avg) table (keyed) for each row

Given a keyed table, e.g.:
q)\S 7 / seed random numbers for reproducibility
q)v:flip (neg[d 0]?`1)!#[;prd[d]?12] d:4 6 / 4 cols 6 rows
q)show kt:([]letter:d[1]#.Q.an)!v
letter| c g b e
------| ----------
a | 11 0 3 9
b | 11 8 10 0
c | 7 2 2 3
d | 8 4 9 6
e | 0 0 5 0
f | 1 0 0 11
How to calculate an average for each row --- e.g. (c+g+b+e)%4 --- for any number of columns?
Following on from your own solution, note that you have to be a little careful with null handling. Your approach won't ignore nulls in the way that avg normally would.
q).[`kt;("a";`g);:;0N];
q)update av:avg flip value kt from kt
letter| c g b e av
------| ---------------
a | 11 3 9
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
To make it ignore nulls you have to avg each row rather than averaging the flip.
q)update av:avg each value kt from kt
letter| c g b e av
------| -------------------
a | 11 3 9 7.666667
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
Solution 1: q-sql
q)update av:avg flip value kt from kt
letter| c g b e av
------| ---------------
a | 11 0 3 9 5.75
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
Solution 2: functional q-sql
tl;dr:
q)![kt;();0b;](1#`av)!enlist(avg;)enlist,cols[`kt]except cols key`kt
letter| c g b e av
------| ---------------
a | 11 0 3 9 5.75
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
let's start with a look how the parse tree of a non-general solution would look like:
q)parse"update av:avg (c;g;b;e) from kt"
!
`kt
()
0b
(,`av)!,(avg;(enlist;`c;`g;`b;`e))
(note that q is a wrapper implemented in k, so the , prefix operator in the above expression is the same as enlist keyword in q)
so all the below are equivalent (verify with ~). relying on projection: (x;y)~(x;)y, we can further improve the readability by reducing the distance between parens:
q)k)(!;`kt;();0b;(,`av)!,(avg;(enlist;`c;`g;`b;`e)))
q)(!;`kt;();0b;(enlist`av)!enlist(avg;(enlist;`c;`g;`b;`e)))
q)(!;`kt;();0b;(1#`av)!enlist(avg;(enlist;`c;`g;`b;`e)))
q)(!;`kt;();0b;)(1#`av)!enlist(avg;)(enlist;`c;`g;`b;`e)
let's evaluate the parse tree to check:
q)eval(!;`kt;();0b;)(1#`av)!enlist(avg;)(enlist;`c;`g;`b;`e)
letter| c g b e av
------| ---------------
a | 11 0 3 9 5.75
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
(enlist;`c;`g;`b;`e) in the general case is:
q)enlist,cols[`kt]except cols key`kt
enlist
`c
`g
`b
`e
so let's plug in and check:
q)eval(!;`kt;();0b;(1#`av)!enlist(avg;)enlist,cols[`kt]except cols key`kt)
letter| c g b e av
------| ---------------
a | 11 0 3 9 5.75
b | 11 8 10 0 7.25
c | 7 2 2 3 3.5
d | 8 4 9 6 6.75
e | 0 0 5 0 1.25
f | 1 0 0 11 3
also:
q)![`kt;();0b;(1#`av)!enlist(avg;)enlist,cols[`kt]except cols key`kt]
q)![ kt;();0b;](1#`av)!enlist(avg;)enlist,cols[`kt]except cols key`kt

Create a Boolean column displaying comparison between 2 other columns in kdb+

I'm currently learning kdb+/q.
I have a table of data. I want to take 2 columns of data (just numbers), compare them and create a new Boolean column that will display whether the value in column 1 is greater than or equal to the value in column 2.
I am comfortable using the update command to create a new column, but I don't know how to ensure that it is Boolean, how to compare the values and a method to display the "greater-than-or-equal-to-ness" - is it possible to do a simple Y/N output for that?
Thanks.
/ dummy data
q) show t:([] a:1 2 3; b: 0 2 4)
a b
---
1 0
2 2
3 4
/ add column name 'ge' with value from b>=a
q) update ge:b>=a from t
a b ge
------
1 0 0
2 2 1
3 4 1
Use a vector conditional:
http://code.kx.com/q/ref/lists/#vector-conditional
q)t:([]c1:1 10 7 5 9;c2:8 5 3 4 9)
q)r:update goe:?[c1>=c2;1b;0b] from t
c1 c2 goe
-------------
1 8 0
10 5 1
7 3 1
5 4 1
9 9 1
Use meta to confirm the goe column is of boolean type:
q)meta r
c | t f a
-------| -----
c1 | j
c2 | j
goe | b
The operation <= works well with vectors, but in some cases when a function needs atoms as input for performing an operation, you might want to use ' (each-both operator).
e.g. To compare the length of symbol string with another column value
q)f:{x<=count string y}
q)f[3;`ab]
0b
q)t:([] l:1 2 3; s: `a`bc`de)
q)update r:f'[l;s] from t
l s r
------
1 a 1
2 bc 1
3 de 0

Stata merge with multiple match variables

I am having difficulty combining datasets for a project. Our primary dataset is organized by individual judges. It is an attribute dataset.
judge
j | x | y | z
----|----|----|----
1 | 2 | 3 | 4
2 | 5 | 6 | 7
The second dataset is a case database. Each observation is a case and judges can appear in one of three variables.
case
case | j1 | j2 | j3 | year
-----|----|----|----|-----
1 | 1 | 2 | 3 | 2002
2 | 2 | 3 | 1 | 1997
We would like to merge the case database into the attribute database, matching by judge. So, for each case that a judge appears in j1, j2, or j3, an observation for that case would be added creating a dataset that looks like below.
combined
j | x | y | z | case | year
---|----|----|----|-------|--------
1 | 2 | 3 | 4 | 1 | 2002
1 | 2 | 3 | 4 | 2 | 1997
2 | 5 | 6 | 7 | 1 | 2002
2 | 5 | 6 | 7 | 2 | 1997
My best guess is to use
rename j1 j
merge 1:m j using case
rename j j1
rename j2 j
merge 1:m j using case
However, I am unsure that this will work, especially since the merging dataset has three possible variables that the j identification can occur in.
Your examples are clear, but even better would be present them as code that would not require engineering edits to remove the scaffolding. See dataex from SSC (ssc inst dataex).
It's a case of the missing reshape, I think.
clear
input j x y z
1 2 3 4
2 5 6 7
end
save judge
clear
input case j1 j2 j3 year
1 1 2 3 2002
2 2 3 1 1997
end
reshape long j , i(case) j(which)
merge m:1 j using judge
list
+-------------------------------------------------------+
| case which j year x y z _merge |
|-------------------------------------------------------|
1. | 1 1 1 2002 2 3 4 matched (3) |
2. | 2 3 1 1997 2 3 4 matched (3) |
3. | 2 1 2 1997 5 6 7 matched (3) |
4. | 1 2 2 2002 5 6 7 matched (3) |
5. | 2 2 3 1997 . . . master only (1) |
|-------------------------------------------------------|
6. | 1 3 3 2002 . . . master only (1) |
+-------------------------------------------------------+
drop if _merge < 3
list
+---------------------------------------------------+
| case which j year x y z _merge |
|---------------------------------------------------|
1. | 1 1 1 2002 2 3 4 matched (3) |
2. | 2 3 1 1997 2 3 4 matched (3) |
3. | 2 1 2 1997 5 6 7 matched (3) |
4. | 1 2 2 2002 5 6 7 matched (3) |
+---------------------------------------------------+

Reduce matrix by adding every n

Not sure how to explain it so here it goes as an example:
A=[1 0 0 1 4 4 4 4
0 0 0 0 2 3 2 2
0 0 0 0 0 0 0 1
2 3 4 5 2 3 4 1 ]
result:
b=[ 1 1 13 12
5 9 5 6];
Each of the elements is computed by adding a N size submatrix inside the original, in this case N=2.
so b(1,1) is A(1,1)+A(1,2)+A(2,1)+A(2,2), and b(1,4) is A(1,7)+A(2,7)+A(1,8)+A(2,8).
Visually and more clearly:
A=[|1 0| 0 1| 4 4| 4 4|
|0 0| 0 0| 2 3| 2 2|
____________________
|0 0| 0 0| 0 0| 0 1|
|2 3| 4 5| 2 3| 4 1| ]
b is the sum of the elements on those squares, in this example of size 2.
I can imagine how to make it with loops, but its just feels vectorizable. Any ideas of how it could be done?
Assume that the matrix A has sizes that are multipliers of N.
If you have the Image processing toolbox blockproc could be an option as well:
B = blockproc(A,[2 2],#(x) sum(x.data(:)))
Method 1:
Using mat2cell and cellfun
n = 2;
AC = mat2cell(A,repmat(n,size(A,1)/n,1),repmat(n,size(A,2)/n,1));
out = cellfun(#(x) sum(x(:)), AC)
Method 2:
Using permute and reshape
n = 2;
[rows,cols] = size(A);
out = reshape(sum(sum(permute(reshape(A,n,rows/n,n,[]),[1 3 2 4]))),rows/n,[]);
PS: Here is a close question related to this one, which you might find useful. That question is to find mean while this one is to find sum.
Here are two alternative methods:
Method #1 - im2col
Another method using the image processing toolbox is to use im2col with the distinct flag and sum over all of the resulting columns. You would then need to reshape the matrix back to the right size:
n = 2;
B = im2col(A, [n n], 'distinct');
C = reshape(sum(B, 1), size(A,1)/n, size(A,2)/n);
We get for C:
>> C
C =
1 1 13 12
5 9 5 6
Method #2 - accumarray and kron
We can generate an index matrix with kron which we can use as bins into accumarray and invoke sum as the custom function. We would again have to reshape the matrix back to the right size:
n = 2;
M = reshape(1:prod([size(A,1)/n, size(A,2)/n]), size(A,1)/n, size(A,2)/n);
ind = kron(M, ones(n));
C = reshape(accumarray(ind(:), A(:), [], #sum), size(A,1)/n, size(A,2)/n);
Again we get for C:
C =
1 1 13 12
5 9 5 6