Conditional Group By on Selected rows in KDB/Q-Sql - kdb

I have a requirement where I have to execute multiple queries and perform group by on a column with where clause , group by column is fixed and where condition will be perform on fixed column with variable criteria . Only Column name and aggregation type will be varies
For example if I have table :
k1 k2 val1 val2
1 1 10 30
1 1 20 31
1 2 30 32
2 2 40 33
2 3 50 34
2 4 60 35
2 4 70 36
3 4 80 37
3 5 90 38
3 5 100 39
t:([] k1:1 1 1 2 2 2 2 3 3 3; k2:1 1 2 2 3 4 4 4 5 5; val1:10 20 30 40 50 60 70 80 90 100; val2:31 31 32 33 34 35 36 37 38 39)
Queries which I need to perform will be like
select avg_val1:avg val1 by k1 from t where k2 in 2 3 4
select sum_val1:sum val1 by k1 from t where k2 in 2 3
select sum_val2:sum val2 by k1 from t where k2 in 2 3 5
select min_val2:min val2 by k1 from t where k2 in 1 2 3 4 5
I want to execute these queries in a single execution using functional queries. I tried this, but not able to put right condition and syntax
res:?[t;();(enlist`k1)!enlist`k1;(`avg_val1;`sum_val2)!({$[x; y; (::)]}[1b;(avg;`val1)];{$[x; y; (::)]}[1b; (sum;`val2)])];
k1 avg_val1 sum_val2
1 20.0 94
2 55.0 138
3 90.0 114
Instead putting 1b in condition , i want to put real condition like this:
res:?[t;();(enlist`k1)!enlist`k1;(`avg_val1;`sum_val2)!({$[x; y; (::)]}[(in;`k2;2 3 4i);(avg;`val1)];{$[x; y; (::)]}[(in;`k2;2 3i); (sum;`val2)])];
But it will give "type" error, since query will be first group by k1 ,and k2 will be list. So condition is also not right.
I want to know what can be the best solution for this.
May be there can be better approach to solve the same .
Please help me to in same.
Thank you.

The vector conditional (?) operator can get you closer to what you'd like.
Given your table
t:([] k1:1 1 1 2 2 2 2 3 3 3; k2:1 1 2 2 3 4 4 4 5 5; val1:10 20 30 40 50 60 70 80 90 100; val2:31 31 32 33 34 35 36 37 38 39)
k1 k2 val1 val2
---------------
1 1 10 31
1 1 20 31
1 2 30 32
2 2 40 33
2 3 50 34
2 4 60 35
2 4 70 36
3 4 80 37
3 5 90 38
3 5 100 39
you can update, say, the val1 column to hold null values wherever a condition does not hold
update val1:?[k2 in 2 3 4;val1;0N] from t
k1 k2 val1 val2
---------------
1 1 31
1 1 31
1 2 30 32
2 2 40 33
2 3 50 34
2 4 60 35
2 4 70 36
3 4 80 37
3 5 38
3 5 39
and with a little more work you can get the desired aggregate (NB: the aggregate functions ignore null values)
select avg ?[k2 in 2 3 4;val1;0N] by k1 from t
k1| x
--| --
1 | 30
2 | 55
3 | 80
You can wrap this up into a functional select statement like so
?[t;();{x!x}enlist`k1;`avg_val1`sum_val2!((avg;(?;(in;`k2;2 3 4);`val1;0N));(sum;(?;(in;`k2;2 3);`val2;0N)))]
k1| avg_val1 sum_val2
--| -----------------
1 | 30 32
2 | 55 67
3 | 80 0
However, this can break when you use an function that does not ignore nulls, e.g. count. You may be better off using the where operator in you select statement:
select avg val1 where k2 in 2 3 4 by k1 from t
k1| x
--| --
1 | 30
2 | 55
3 | 80
?[t;();{x!x}enlist`k1;`avg_val1`sum_val2!((avg;(`val1;(where;(in;`k2;2 3 4))));(sum;(`val2;(where;(in;`k2;2 3)))))]
k1| avg_val1 sum_val2
--| -----------------
1 | 30 32
2 | 55 67
3 | 80 0

Related

Pyspark dataframe conditional filter and imputation

I have a pyspark dataframe df
ID
Total_Count
A
B
C
D
Group
Name
Chain
1
56
0
0
0
0
1
Apple
Fruits1
2
65
0
0
0
0
1
Apple
Fruits1
3
72
0
0
30
0
1
Banana
Fruits1
4
80
0
0
0
0
1
Strawberry
Fruits1
5
142
58
58
14
12
1
Apple
Fruits1
6
130
63
50
9
8
1
Apple
Fruits1
7
145
74
44
17
10
1
Apple
Fruits1
8
119
54
48
8
9
1
Apple
Fruits1
11
161
71
63
16
11
1
Banana
Fruits1
12
124
54
43
19
8
1
Banana
Fruits1
I want to impute the A,B,C,D columns wherever there is 0 in A,B,C,D columns(ID 1,2,3,4).
1.) Logic : Average of GroupxName(if available) or Average of GroupxChain(if available) or at Average of Group :
Taking the example to impute ID 1,2 for demo:
Post filering for Group 1 and Name Apple, Proportion for ID 1&2 is obtained as follows( For ID 1 and 2 resp. filtering rows with similar Group as 1 and similar Name (Apple)) ,proportion is calculated as A/Total_Count, B/Total_Count and so on :
A_PROP
B_PROP
C_PROP
D_PROP
0.408451
0.408450704
0.098592
0.084507042
0.484615
0.384615385
0.069231
0.061538462
0.510345
0.303448276
0.117241
0.068965517
0.453782
0.403361345
0.067227
0.075630252
2.) Average of the above 4 rows is to be taken (for ID 1 & 2 for example).
A,B,C,D in df2 is calcualted as X_prop_avg*Total_Count.
Expected output (df2) :
ID
Total_Count
A_prop_avg
B_prop_avg
C_prop_avg
D_prop_avg
A
B
C
D
1
56
0.46429811
0.37496893
0.08807265
0.07266032
26
21
5
4
2
65
0.464298107
0.374968927
0.088072647
0.072660318
30
24
6
5
3
72
0.43823883
0.369039271
0.126302344
0.066419555
32
27
9
5
4
80
0.455611681
0.372992375
0.10081588
0.070580064
36
30
8
6

kdb - KDB Apply logic where column exists - data validation

I'm trying to perform some simple logic on a table but I'd like to verify that the columns exists prior to doing so as a validation step. My data consists of standard table names though they are not always present in each data source.
While the following seems to work (just validating AAA at present) I need to expand to ensure that PRI_AAA (and eventually many other variables) is present as well.
t: $[`AAA in cols `t; temp: update AAA_VAL: AAA*AAA_PRICE from t;()]
Two part question
This seems quite tedious for each variable (imagine AAA-ZZZ inputs and their derivatives). Is there a clever way to leverage a dictionary (or table) to see if a number of variables exists or insert a place holder column of zeros if they do not?
Similarly, can we store a formula or instructions to to apply within a dictionary (or table) to validate and return a calculation (i.e. BBB_VAL: BBB*BBB_PRICE.) Some calculations would be dependent on others (i.e. BBB_Tax_Basis = BBB_VAL - BBB_COSTS costs for example so there could be iterative issues.
Thank in advance!
A functional update may be the best way to achieve this if your intention is to update many columns of a table in a similar fashion.
func:{[t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")];
t;
];
};
This function will update t with *_VAL columns for any column you pass as an argument, while first also adding a zero column for any missing columns passed as an argument.
q)t:([]AAA:10?100;BBB:10?100;CCC:10?100;AAA_PRICE:10*10?10;BBB_PRICE:10*10?10;CCC_PRICE:10*10?10;DDD_PRICE:10*10?10)
q)func/[t;`AAA`BBB`CCC`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL CCC_VAL DDD DDD_VAL
---------------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0 0
39 17 97 50 90 40 10 1950 1530 3880 0 0
76 11 11 0 0 50 10 0 0 550 0 0
26 55 99 20 60 80 90 520 3300 7920 0 0
91 51 3 30 20 0 60 2730 1020 0 0 0
83 81 7 70 60 40 90 5810 4860 280 0 0
76 68 98 40 80 90 70 3040 5440 8820 0 0
88 96 30 70 0 80 80 6160 0 2400 0 0
4 61 2 70 90 0 40 280 5490 0 0 0
56 70 15 0 50 30 30 0 3500 450 0 0
As you've already mentioned, to cover point 2, a dictionary of functions might be the best way to go.
q)dict:raze{(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")}each`AAA`BBB`DDD
q)dict
AAA_VAL| * `AAA `AAA_PRICE
BBB_VAL| * `BBB `BBB_PRICE
DDD_VAL| * `DDD `DDD_PRICE
And then a slightly modified function...
func:{[dict;t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(dict`$string[x],"_VAL")];
t;
];
};
yields a similar result.
q)func[dict]/[t;`AAA`BBB`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL DDD DDD_VAL
-------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0
39 17 97 50 90 40 10 1950 1530 0 0
76 11 11 0 0 50 10 0 0 0 0
26 55 99 20 60 80 90 520 3300 0 0
91 51 3 30 20 0 60 2730 1020 0 0
83 81 7 70 60 40 90 5810 4860 0 0
76 68 98 40 80 90 70 3040 5440 0 0
88 96 30 70 0 80 80 6160 0 0 0
4 61 2 70 90 0 40 280 5490 0 0
56 70 15 0 50 30 30 0 3500 0 0
Here's another approach which handles dependent/cascading calculations and also figures out which calculations are possible or not depending on the available columns in the table.
q)show map:`AAA_VAL`BBB_VAL`AAA_RevenueP`AAA_RevenueM`BBB_Other!((*;`AAA;`AAA_PRICE);(*;`BBB;`BBB_PRICE);(+;`AAA_Revenue;`AAA_VAL);(%;`AAA_RevenueP;1e6);(reciprocal;`BBB_VAL));
AAA_VAL | (*;`AAA;`AAA_PRICE)
BBB_VAL | (*;`BBB;`BBB_PRICE)
AAA_RevenueP| (+;`AAA_Revenue;`AAA_VAL)
AAA_RevenueM| (%;`AAA_RevenueP;1000000f)
BBB_Other | (%:;`BBB_VAL)
func:{c:{$[0h=type y;.z.s[x]each y;-11h<>type y;y;y in key x;.z.s[x]each x y;y]}[y]''[y];
![x;();0b;where[{all in[;cols x]r where -11h=type each r:(raze/)y}[x]each c]#c]};
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB AAA_VAL AAA_RevenueP AAA_RevenueM
---------------------------------------------------------------
1 1 10 4 1 11 1.1e-05
2 2 20 5 4 24 2.4e-05
3 3 30 6 9 39 3.9e-05
/if the right columns are there
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6;BBB_PRICE:4 5 6f);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB BBB_PRICE AAA_VAL BBB_VAL AAA_RevenueP AAA_RevenueM BBB_Other
--------------------------------------------------------------------------------------------
1 1 10 4 4 1 16 11 1.1e-05 0.0625
2 2 20 5 5 4 25 24 2.4e-05 0.04
3 3 30 6 6 9 36 39 3.9e-05 0.02777778
The only caveat is that your map can't have the same column name as both the key and in the value of your map, aka cannot re-use column names. And it's assumed all symbols in your map are column names (not global variables) though it could be extended to cover that
EDIT: if you have a large number of column maps then it will be easier to define it in a more vertical fashion like so:
map:(!). flip(
(`AAA_VAL; (*;`AAA;`AAA_PRICE));
(`BBB_VAL; (*;`BBB;`BBB_PRICE));
(`AAA_RevenueP;(+;`AAA_Revenue;`AAA_VAL));
(`AAA_RevenueM;(%;`AAA_RevenueP;1e6));
(`BBB_Other; (reciprocal;`BBB_VAL))
);

(q/kdb+) Generating an automated list

Example 1)
I have the code below
5#10+1*2
that generates
index value
0 12
1 12
2 12
3 12
4 12
How can I replace the number "1" by the index?
then generating
5#10+index*2
index value
0 10
1 12
2 14
3 16
4 18
update Example 2)
Now, if I have, let's say
mult:5;
t:select from ([]numC:1 3 6 4 1;[]s:50 16 53 6 33);
update lst:(numC#'s) from t
the last update will generate
numC s lst
1 50 50
3 16 16 16 16
6 53 53 53 53 53 53 53
4 6 6 6 6 6
1 33 33
How can I generate the "lst" column as per below?
numC s lst
1 50 50+0*mult
3 16 16+0*mult 16+1*mult 16+2*mult
6 53 53+0*mult 53+1*mult 53+2*mult 53+3*mult 53+4*mult 53+5*mult
4 6 6+0*mult 6+1*mult 6+2*mult 6+3*mult
1 33 33+0*mult
I tried something like
update lst:(numC#'s + (til numC)*mult) from t
but I am getting an error
ERROR: 'type
Thanks vm
Is this what you're looking for:
q)x:5
q)x#10+(til x)*2
10 12 14 16 18
http://code.kx.com/q/ref/arith-integer/#til
You can remove take # and use til to simplify to:
q)10+2*til 5
10 12 14 16 18
Using til will create a list of a list of 5 elements (0->4), so you will not need take 5 elements from the resulting list. Take will only be required if your list of indices is greater than 5.
Update:
For your second example the following should work:
q)update lst:{y+x*til z}'[mult;s;numC] from t
q)update lst:s+mult*til each numC from t
numC s lst
-------------------------
1 50 ,50
3 16 16 21 26
6 53 53 58 63 68 73 78
4 6 6 11 16 21
1 33 ,33
There are many ways with which we can get achieve this:
1) 10+2*til 5
2) (2*til 5) + 10
/ take operator: The dyadic take function creates lists. The left argument specifies the count and shape and the right argument provides the data.
It is useful for selecting from the front or end of a list.
https://code.kx.com/wiki/Reference/NumberSign
q)5#0 1 2 3 4 5 6 7 8 / take the first 5 items
0 1 2 3 4
q)-5#0 1 2 3 4 5 6 7 8 / take the last 5 elements
4 5 6 7 8
use take operator # only when it is required.
say we have 10 elements, of which we need five on output, then we can use:
5#10+2*til 10
/ The til function takes a non-negative integer argument X and returns the first X integers

Matlab: how I can transform this algorithm associated with matrices manipulation?

(For my problem, I use a matrix A 4x500000. And the values of A(4,k) varies between 1 and 200).
I give here an example for a case A 4x16 and A(4,k) varies between 1 and 10.
I want first to match a name to the value from 1 to 5 (=10/2):
1 = XXY;
2 = ABC;
3 = EFG;
4 = TXG;
5 = ZPF;
My goal is to find,for a vector X, a matrix M from the matrix A:
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10]
A(4,k) takes all values between 1 and 10. These values can be repeated and they all appear on the 4th line.
20
X= 32 =A(1:3,1)=A(1:3,6)=A(1:3,8)=A(1:3,9)=A(1:3,12)=A(1:3,16)
40
A(4,1) = 2;
A(4,6) = 5;
A(4,8) = 1;
A(4,9) = 3;
A(4,12) = 6;
A(4,16) = 10;
for A(4,k) corresponding to X, I associate 2 if A(4,k)<= 5, and 1 if A(4,k)> 5. For the rest of the value of A(4,k) which do not correspond to X, I associate 0:
[ 1 2 3 4 5 %% value of the fourth line of A between 1 and 5
2 2 2 0 2
ZX = 6 7 8 9 10 %% value of the fourth line of A between 6 and 10
1 0 0 0 1
2 2 2 0 2 ] %% = max(ZX(2,k),ZX(4,k))
the ultimate goal is to find the matrix M:
M = [ 1 2 3 4 5
XXY ABC EFG TXG ZPF
2 2 2 0 2 ] %% M(3,:)=ZX(5,:)
Code -
%// Assuming A, X and names to be given to the solution
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10];
X = [20 ; 32 ; 40];
names = {'XXY','ABC','EFG','TXG','ZPF'};
limit = 10; %// The maximum limit of A(4,:). Edit this to 200 for your actual case
%// Find matching 4th row elements
matches = A(4,ismember(A(1:3,:)',X','rows'));
%// Matches are compared against all possible numbers between 1 and limit
matches_pos = ismember(1:limit,matches);
%// Finally get the line 3 results of M
vals = max(2*matches_pos(1:limit/2),matches_pos( (limit/2)+1:end ));
Output -
vals =
2 2 2 0 2
For a better way to present the results, you can use a struct -
M_struct = cell2struct(num2cell(vals),names,2)
Output -
M_struct =
XXY: 2
ABC: 2
EFG: 2
TXG: 0
ZPF: 2
For writing the results to a text file -
output_file = 'results.txt'; %// Edit if needed to be saved to a different path
fid = fopen(output_file, 'w+');
for ii=1:numel(names)
fprintf(fid, '%d %s %d\n',ii, names{ii},vals(ii));
end
fclose(fid);
Text contents of the text file would be -
1 XXY 2
2 ABC 2
3 EFG 2
4 TXG 0
5 ZPF 2
A bsxfun() based approach.
Suppose your inputs are (where N can be set to 200):
A = [20 52 70 20 52 20 52 20 20 10 52 20 11 1 52 20
32 24 91 44 60 32 24 32 32 12 11 32 2 5 24 32
40 37 24 30 11 40 37 40 40 5 10 40 40 3 37 40
2 4 1 3 4 5 2 1 3 3 8 6 7 9 6 10]
X = [20; 32; 40]
N = 10;
% Match first 3 rows and return 4th
idxA = all(bsxfun(#eq, X, A(1:3,:)));
Amatch = A(4,idxA);
% Match [1:5; 5:10] to 4th row
idxZX = ismember([1:N/2; N/2+1:N], Amatch)
idxZX =
1 1 1 0 1
1 0 0 0 1
% Return M3
M3 = max(bsxfun(#times, idxZX, [2;1]))
M3 =
2 2 2 0 2

Matrix division & permutation to achieve Baker map

I'm trying to implement the Baker map.
Is there a function that would allow one to divide a 8 x 8 matrix by providing, for example, a sequence of divisors 2, 4, 2 and rearranging pixels in the order as shown in the matrices below?
X = reshape(1:64,8,8);
After applying divisors 2,4,2 to the matrix X one should get a matrix like A shown below.
A=[31 23 15 7 32 24 16 8;
63 55 47 39 64 56 48 40;
11 3 12 4 13 5 14 6;
27 19 28 20 29 21 30 22;
43 35 44 36 45 37 46 38;
59 51 60 52 61 53 62 54;
25 17 9 1 26 18 10 2;
57 49 41 33 58 50 42 34]
The link to the document which I am working on is:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.5132&rep=rep1&type=pdf
This is what I want to achieve:
Edit: a little more generic solution:
%function Z = bakermap(X,divisors)
function Z = bakermap()
X = reshape(1:64,8,8)'
divisors = [ 2 4 2 ];
[x,y] = size(X);
offsets = sum(divisors)-fliplr(cumsum(fliplr(divisors)));
if any(mod(y,divisors)) && ~(sum(divisors) == y)
disp('invalid divisor vector')
return
end
blocks = #(div) cell2mat( cellfun(#mtimes, repmat({ones(x/div,div)},div,1),...
num2cell(1:div)',...
'UniformOutput',false) );
%create index matrix
I = [];
for ii = 1:numel(divisors);
I = [I, blocks(divisors(ii))+offsets(ii)];
end
%create Baker map
Y = flipud(X);
Z = [];
for jj=1:I(end)
Z = [Z; Y(I==jj)'];
end
Z = flipud(Z);
end
returns:
index matrix:
I =
1 1 3 3 3 3 7 7
1 1 3 3 3 3 7 7
1 1 4 4 4 4 7 7
1 1 4 4 4 4 7 7
2 2 5 5 5 5 8 8
2 2 5 5 5 5 8 8
2 2 6 6 6 6 8 8
2 2 6 6 6 6 8 8
Baker map:
Z =
31 23 15 7 32 24 16 8
63 55 47 39 64 56 48 40
11 3 12 4 13 5 14 6
27 19 28 20 29 21 30 22
43 35 44 36 45 37 46 38
59 51 60 52 61 53 62 54
25 17 9 1 26 18 10 2
57 49 41 33 58 50 42 34
But have a look at the if-condition, it's just possible for these cases. I don't know if that's enough. I also tried something like divisors = [ 1 4 1 2 ] - and it worked. As long as the sum of all divisors is equal the row-length and the modulus as well, there shouldn't be problems.
Explanation:
% definition of anonymous function with input parameter: div: divisor vector
blocks = #(div) cell2mat( ... % converts final result into matrix
cellfun(#mtimes, ... % multiplies the next two inputs A,B
repmat(... % A...
{ones(x/div,div)},... % cell with a matrix of ones in size
of one subblock, e.g. [1,1,1,1;1,1,1,1]
div,1),... % which is replicated div-times according
to actual by cellfun processed divisor
num2cell(1:div)',... % creates a vector [1,2,3,4...] according
to the number of divisors, so so finally
every Block A gets an increasing factor
'UniformOutput',false...% necessary additional property of cellfun
));
Have also a look at this revision to have a simpler insight in what is happening. You requested a generic solution, thats the one above, the one linked was with more manual inputs.