q - apply function on table rowwise - kdb

Given a table and a function
t:([] c1:1 2 3; c2:`a`b`c; c3:13:00 13:01 13:02)
f:{[int;sym;date]
symf:{$[x=`a;1;x=`b;2;3]};
datef:{$[x=13:00;1;x=13:01;2;3]};
r:int + symf[sym] + datef[date];
r
};
I noticed that when applying the function f onto columns of t, then the entire columns are passed into f and if they can be operated on atomically then the output will be of the same length as the inputs and a new column is produced. However in our example this wont work:
update newcol:f[c1;c2;c3] from t / 'type error
because the inner functions symf and datef cannot be applied to the entire column c2, c3, respectively.
If I dont want to change the function f at all, how can I apply it row by row and collect the values into a new column in t.
What's the most q style way to do this?
EDIT
If not changing f is really inconvenient one could workaround like so
f:{[arglist]
int:arglist 0;
sym:arglist 1;
date:arglist 2;
symf:{$[x=`a;1;x=`b;2;3]};
datef:{$[x=13:00;1;x=13:01;2;3]};
r:int + symf[sym] + datef[date];
r
};
f each (t`c1),'(t`c2),'(t`c3)
Still I would be interested how to get the same result when working with the original version of f
Thanks!

You can use each-both for this e.g.
q)update newcol:f'[c1;c2;c3] from t
c1 c2 c3 newcol
------------------
1 a 13:00 3
2 b 13:01 6
3 c 13:02 9
However you will likely get better performance by modifying f to be "vectorised" e.g.
q)f2
{[int;sym;date]
symf:3^(`a`b!1 2)sym;
datef:3^(13:00 13:01!1 2)date;
r:int + symf + datef;
r
}
q)update newcol:f2[c1;c2;c3] from t
c1 c2 c3 newcol
------------------
1 a 13:00 3
2 b 13:01 6
3 c 13:02 9
q)\ts:1000 update newcol:f2[c1;c2;c3] from t
4 1664
q)\ts:1000 update newcol:f'[c1;c2;c3] from t
8 1680
In general in KDB, if you can avoid using any form of each and stick to vector operations, you'll get much more efficiency

Related

q/KDB - nprev function to get all the previous n elements

I am struggling to write a nprev function in KDB; xprev function returns the nth element but I need all the prev n elements relative to the current element.
q)t:([] i:1+til 26; s:.Q.a)
q)update xp:xprev[3;]s,p:prev s from t
Any help is greatly appreciated.
You can achieve the desired result by applying prev repeatedly and flipping the result
q)n:3
q)select flip 1_prev\[n;s] from t
s
-----
" "
"a "
"ba "
"cba"
"dcb"
"edc"
..
If n is much smaller than the rows count, this will be faster than some of the more straightforward solutions.
The xprev function basically looks like this :
xprev1:{y til[count y]-x} //readable xprev
We can tweak it to get all n elements
nprev:{y til[count y]-\:1+til x}
using nprev in the query
q)update np: nprev[3;s] , xp1:xprev1[3;s] , xp: xprev[3;s], p:prev[s] from t
i s np xp1 xp p
-------------------
1 a " "
2 b "a " a
3 c "ba " b
4 d "cba" a a c
5 e "dcb" b b d
6 f "edc" c c e
k equivalent of nprev
k)nprev:{$[0h>#y;'`rank;y(!#y)-\:1+!x]}
and similarly nnext would look like
k)nnext:{$[0h>#y;'`rank;y(!#y)+\:1+!x]}

Table sort by month

I have a table in MATLAB with attributes in the first three columns and data from the fourth column onwards. I was trying to sort the entire table based on the first three columns. However, one of the columns (Column C) contains months ('January', 'February' ...etc). The sortrows function would only let me choose 'ascend' or 'descend' but not a custom option to sort by month. Any help would be greatly appreciated. Below is the code I used.
sortrows(Table, {'Column A','Column B','Column C'} , {'ascend' , 'ascend' , '???' } )
As #AnonSubmitter85 suggested, the best thing you can do is to convert your month names to numeric values from 1 (January) to 12 (December) as follows:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t.ColumnC = month(datenum(t.ColumnC,'mmmm'));
This will facilitate the access to a standard sorting criterion for your ColumnC too (in this example, ascending):
t = sortrows(t,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
If, for any reason that is unknown to us, you are forced to keep your months as literals, you can use a workaround that consists in sorting a clone of the table using the approach described above, and then applying to it the resulting indices:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t_original = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t_clone = t_original;
t_clone.ColumnC = month(datenum(t_clone.ColumnC,'mmmm'));
[~,idx] = sortrows(t_clone,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
t_original = t_original(idx,:);

How can I sum up functions that are made of elements of the imported dataset?

See the code and error. I have already tried Do, For,...and it is not working.
CODE + Error from Mathematica:
Import of survival probabilities _{k}p_x and _{k}p_y (calculated in excel)
px = Import["C:\Users\Eva\Desktop\kpx.xlsx"];
px = Flatten[Take[px, All], 1];
NOTE: The probability _{k}p_x can be found on the position px[[k+2, x -16]
i = 0.04;
v = 1/(1 + i);
JointLifeIndep[x_, y_, n_] = Sum[v^k*px[[k + 2, x - 16]]*py[[k + 2, y - 16]], {k , 0, n - 1}]
Part::pkspec1: The expression 2+k cannot be used as a part specification.
Part::pkspec1: The expression 2+k cannot be used as a part specification.
Part::pkspec1: The expression 2+k cannot be used as a part specification.
General::stop: Further output of Part::pkspec1 will be suppressed during this calculation.
Part of dataset (left corner of the dataset):
k\x 18 19 20
0 1 1 1
1 0.999478086278185 0.999363078716059 0.99927911905056
2 0.998841497412202 0.998642656911039 0.99858030519133
3 0.998121451605207 0.99794428814123 0.99788275311401
4 0.997423447323642 0.997247180349674 0.997174407432264
5 0.996726703362208 0.996539285828369 0.996437857252448
6 0.996019178300768 0.995803204773039 0.99563600297737
7 0.995283481416241 0.995001861216016 0.994823584922968
8 0.994482556091416 0.994189960607964 0.99405569519175
9 0.993671079225432 0.99342255996206 0.993339856748282
10 0.992904079096455 0.992707177451333 0.992611817294026
11 0.992189069953677 0.9919796017009 0.991832027835091
Without having the exact same data files to work with it is often easy for each of us to make mistakes that the other cannot reproduce or understand.
From your snapshot of your data set I used Export in Mathematica to try to reproduce your .xlsx file. Then I tried the following
px = Import["kpx.xlsx"];
px = Flatten[Take[px, All], 1];
py = px; (* fake some py data *)
i = 0.04;
v = 1/(1 + i);
JointLifeIndep[x_, y_, n_] := Sum[v^k*px[[k+2,x-16]]*py[[k+2,y-16]], {k,0,n-1}];
JointLifeIndep[17, 17, 12]
and it displays 362.402
Notice I used := instead of = in my definition of JointLifeIndep. := and = do different things in Mathematica. = will immediately evaluate the right hand side of that definition. This is possibly the reason that you are getting the error that you do.
You should also be careful with your subscript values and make sure that every subscript is between 1 and the number of rows (or columns) in your matrix.
So see if you can try this example with an Excel sheet containing only the snapshot of data that you showed and see if you get the same result that I do.
Hopefully that will be enough for you to make progress.

How to use GroupByKey in Spark to calculate nonlinear-groupBy task

I have a table looks like
Time ID Value1 Value2
1 a 1 4
2 a 2 3
3 a 5 9
1 b 6 2
2 b 4 2
3 b 9 1
4 b 2 5
1 c 4 7
2 c 2 0
Here is the tasks and requirements:
I want to set the column ID as the key, not the column Time, but I don't want to delete the column Time. Is there a way in Spark to set Primary Key?
The aggregation function is non-linear, which means you can not use "reduceByKey". All the data must be shuffled to one single node before calculation. For example, the aggregation function may looks like root N of the sum values, where N is the number of records (count) for each ID :
output = root(sum(value1), count(*)) + root(sum(value2), count(*))
To make it clear, for ID="a", the aggregated output value should be
output = root(1 + 2 + 5, 3) + root(4 + 3 + 9, 3)
the later 3 is because we have 3 record for a. For ID='b', it is:
output = root(6 + 4 + 9 + 2, 4) + root(2 + 2 + 1 + 5, 4)
The combination is non-linear. Therefore, in order to get correct results, all the data with the same "ID" must be in one executor.
I checked UDF or Aggregator in Spark 2.0. Based on my understanding, they all assume "linear combination"
Is there a way to handle such nonlinear combination calculation? Especially, taking the advantage of parallel computing with Spark?
Function you use doesn't require any special treatment. You can use plain SQL with join
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{count, lit, sum, pow}
def root(l: Column, r: Column) = pow(l, lit(1) / r)
val out = root(sum($"value1"), count("*")) + root(sum($"value2"), count("*"))
df.groupBy("id").agg(out.alias("outcome")).join(df, Seq("id"))
or window functions:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id")
val outw = root(sum($"value1").over(w), count("*").over(w)) +
root(sum($"value2").over(w), count("*").over(w))
df.withColumn("outcome", outw)

Crystal Reports Cross-tab with mix of Sum, Percentages and Computed values

Being new to crystal, I am unable to figure out how to compute rows 3 and 4 below.
Rows 1 and 2 are simple percentages of the sum of the data.
Row 3 is a computed value (see below.)
Row 4 is a sum of the data points (NOT a percentage as in row 1 and row 2)
Can someone give me some pointers on how to generate the display as below.
My data:
2010/01/01 A 10
2010/01/01 B 20
2010/01/01 C 30
2010/02/01 A 40
2010/02/01 B 50
2010/02/01 C 60
2010/03/01 A 70
2010/03/01 B 80
2010/03/01 C 90
I want to display
2010/01/01 2010/02/01 2010/03/01
========== ========== ==========
[ B/(A + B + C) ] 20/60 50/150 80/240 <=== percentage of sum
[ C/(A + B + C) ] 30/60 60/150 90/240 <=== percentage of sum
[ 1 - A/(A + B + C) ] 1 - 10/60 1 - 40/150 1 - 70/240 <=== computed
[ (A + B + C) ] 60 150 250 <=== sum
Assuming you are using a SQL data source, I suggest deriving each of the output rows' values (ie. [B/(A + B + C)], [C/(A + B + C)], [1 - A/(A + B + C)] and [(A + B + C)]) per date in the SQL query, then using Crystal's crosstab feature to pivot them into the output format desired.
Crystal's crosstabs aren't particularly suited to deriving different calculations on different rows of output.