KDB - Automatic function argument behavior with Iterators - kdb

I'm struggling to understand the behavior of the arguments in the below scan function. I understand the EWMA calc and have made an Excel worksheet to match in an attempt to try to understand but the kdb syntax is throwing me off in terms of what (and when) is x,y and z. I've referenced Q for Mortals, books and https://code.kx.com/q/ref/over/ and I do understand whats going on in the simpler examples provided.
I understand the EWMA formula based on the Excel calc but how is that translated into the function below?
x = constant, y= passed in values (but also appears to be prior result?) and z= (prev period?)
ewma: {{(y*1-x)+(z*x)} [x]\[y]};
ewma [.25; 15 20 25 30 35f]
15 16.25 18.4375 21.32813 24.74609
Rearranging terms makes it easier to read but if I were write this in Excel, I would incorrectly reference the y value column in the addition operator instead of correctly referencing the prev EWMA value.
ewma: {{y+x*z-y} [x]\[y]};
ewma [.25; 15 20 25 30 35f]
15 16.25 18.4375 21.32813 24.74609
EWMA in Excel formula for auditing

0N! is useful in these cases for determining variables passed. Simply add to start of function to display variable in console. EG. to show what z is being passed in as each run:
q)ewma: {{0N!z;(y*1-x)+(z*x)} [x]\[y]};
q)ewma [.25; 15 20 25 30 35f]
15f
16.25
18.4375
21.32812
//Or multiple at once
q)ewma: {{0N!(x;y;z);(y*1-x)+(z*x)} [x]\[y]};
q)
q)ewma [.25; 15 20 25 30 35f]
0.25 15 20
0.25 16.25 25
0.25 18.4375 30
0.25 21.32812 35
Edit:
To think about why z is holding 'y' values it is best to think about below simplified example using just x/y.
//two parameters specified in beginning.
//x initialised as 1 then takes the function result for next run
//y takes value of next value in list
q){0N!(x;y);x+y}\[1;2 3 4]
1 2
3 3
6 4
3 6 10
//in this example only one parameter is passed
//but q takes first value in list as x in this special case
q){0N!(x;y);x+y}\[1 2 3 4]
1 2
3 3
6 4
1 3 6 10
A similar occurrence is happening in your example. x is not being passed to the the iterator and therefore will assume the same value in each run.
The inner function y value will be initilised taking the first value of the outer y variable (15f in this case) like above simplified example. Then the z takes the 2nd value of the list for it's initial run. y then takes the result of previous function run and z takes the next value in the list until how list has bee passed to function.

Related

Sum in range until value change

I'am trying to use this formula to make it work
=ARRAYFORMULA(IF(ISDATE_STRICT(S2:S) ; (MATCH(MAX(AB2:AB),AB2:AB;0)-1) ; "" ))
If there is a date in Column "S" I want it to display the sum of the blanks that would appear if in Column "S" is text
=ARRAYFORMULA(IF(ISDATE_STRICT(S2:S) ; ArrayFormula(MATCH(FALSE ; ISBLANK(AB2:AB) ; 0)-1) ; "" ))
I've tried this one as well but I only get 0's as a result.
Any idea how I can make it work?
Here is the sample sheet.
https://docs.google.com/spreadsheets/d/19f5phXeAwXwrKbWz7njgbznmurOav72GUuo_5IGcbls/edit?usp=sharing
in Q2 use:
=ARRAYFORMULA(IF(ISBLANK(
I1:INDEX(I:I; ROWS(I:I)-1));
{N2:INDEX(N:N; ROWS(N:N))\
I1:INDEX(N:N; ROWS(N:N)-1)};
I1:INDEX(O:O; ROWS(O:O)-1)))
in X2 use:
=INDEX(LAMBDA(x; IFNA(VLOOKUP(x; QUERY(VLOOKUP(ROW(x);
IF(ISDATE_STRICT(x); {ROW(x)\x}); 2; 1);
"select Col1,count(Col1) group by Col1"); 2; 0)-1))
(Q2:INDEX(Q:Q; MAX((Q:Q<>"")*ROW(Q:Q)))))
UPDATE:
we start with column Q. we can take a range Q2:Q but that range contains a lot of empty rows. the next best thing is to check the last non-empty row and set it as the end of the range resulting in Q2:Q73. but static 73 won't do in case the dataset would grow or shrink so to get 73 dynamically we take the MAX of multiplication of Q:Q not being empty and row number of that case eg. Q:Q<>"" will output only TRUE or FALSE so what we are getting is
...
TRUE * 72 = 1 * 72 = 72
TRUE * 73 = 1 * 73 = 73
FALSE * 74 = 0 * 74 = 0
...
so the formula for getting Q2:Q73 is:
=Q2:INDEX(Q:Q; MAX((Q:Q<>"")*ROW(Q:Q)))
it could also be:
=INDEX(INDIRECT("Q2:Q"&MAX((Q:Q<>"")*ROW(Q:Q))))
but it's just long to type... next, we use the new LAMBDA function that allows us to reference cell/range/formula with a placeholder. simple LAMBDA syntax is:
=LAMBDA(x; x)(A1)
where x is A1 and we can do whatever we want with the 2nd (x) argument of LAMBDA like for example:
=LAMBDA(a, a+a*120-a/a)(A1)
you can think of it as:
LAMBDA(A1, A1+A1*120-A1/A1)(A1)
or as just:
=A1+A1*120-A1/A1
the issue here is that we repeat A1 4 times but with LAMBDA we do it only once. also, imagine if we would have 100 characters long formula instead of A1 so the final formula with lambda would be 300 characters shorter compared to "old way" formula.
back to our formula... x is the representation of Q2:Q73. now let's focus on VLOOKUP. basically, the idea here is that IF Q column contains a date we return that date, otherwise we return the last date from above. simply put:
=ARRAYFORMULA(VLOOKUP(ROW(Q2:Q73);
IF(ISDATE_STRICT(Q2:Q73); {ROW(Q2:Q73)\Q2:Q73}); 2; 1))
as you can see Y2, Y3 and Y4 are the same so all we need to do is to count them up and later take away one to exclude Q2 but include just Q3 and Q4 eg. 3-1=2. for that we use simple QUERY where the output is:
date count
30.06.2022 3
so all we need to do is to pair up dates from Q column to QUERY output for that we use the outer VLOOKUP where the output is as follows:
3
#N/A
#N/A
9
#N/A
#N/A
...
now is the right time for that -1 correction while we have these errors coz ERROR-1=ERROR and 3-1=2 so after this -1 correction the output is:
2
#N/A
#N/A
8
#N/A
#N/A
...
and all we need to do now is to hide errors with IFERROR and the output is column X

Efficient method to query percentile in a list

I've come across the requirement to collect the percentiles from a list a few times:
Within what percentile is a certain number?
What is the nth percentile in a list?
I have written these methods to solve the issue:
/for 1:
percentileWithinThreshold:{[threshold;list] (100 * count where list <= threshold) % count list};
/for 2:
thresholdForPercentile:{[percentile;list] (asc list)[-1 + "j"$((percentile % 100) * count list)]};
They work well for both use cases, but I was thinking this is a too common use case, so probably Q offers already something out of the box that does the same. Any idea if there already exists something else?
'100 xrank' generates percentiles.
q) 100 xrank 1 2 3 4
q) 0 25 50 75
Solution for your second requirement:
q) f:{ y (100 xrank y:asc y) bin x}
Also, note that your second function result will not be always same as xrank. Reason for that is 'xrank' uses floor for fractional index output which is the normal scenario with calculating percentiles and your function round up the value and subtracts -1 which ensures that output will always be lesser-equal to input percentile. For example:
q) thresholdForPercentile[63;til 21] / output 12
q) f[63;til 21] / output 13
For first requirement, there is no inbuilt function. However you could improve your function if you keep your input list sorted because in that case you could use 'bin' function which runs faster on big lists.
q) percentileWithinThreshold:{[threshold;list] (100 * 1+list bin threshold) % count list};
Remember that 'bin' will throw type error if one argument is of float type and other is an integer. So make sure to cast them correctly inside the function.
qtln:{[x;y;z]cf:(0 1;1%2 2;0 0;1 1;1%3 3;3%8 8) z-4;n:count y:asc y;?[hf<1;first y;last y]^y[hf-1]+(h-hf)*y[hf]-y -1+hf:floor h:cf[0]+x*n+1f-sum cf}
qtl:qtln[;;8];

Select a number of random rows based on one columan condition in matlab

I have a table 'X' like this:
name value score
joy 3 60
rony 8 50
macheis 20 20
joung 2 80
joy 8 3
joy 90 0
joung 4 78
machies 3 23
joy 7 99
I want to select 2 random rows(with name, value, score) where the name is 'joy'.
I applied something like this:
mnew = datasample(find(X.name=='joy'),2); but it does not work! and gives me the error: Undefined operator '==' for input arguments of type 'cell'.
The rows should be selected randomly (with all columns values) where the name is joy.
Does anyone any other solution of this problem? how can i do it in MATLAB?
You have the right idea, but in order to check for the presence of a string within a cell array of strings, you need to use strcmp, ismember, or another method for comparing a string to a cell array.
You probably also want to specify that you don't want to use replacement when calling datasample so you don't get the same row twice.
subx = X(datasample(find(strcmp(X.name, 'joy')), 2, 'Replace', false),:);

Reshape (#) doesn't work with a dynamic argument

To form a matrix consisting of identical rows, one could use
x:1 2 3
2 3#x,x
which produces (1 2 3i;1 2 3i) as expected. However, attempting to generalise this thus:
2 (count x)#x,x
produces a type error although the types are equal:
(type 3) ~ type count x
returns 1b. Why doesn't this work?
The following should work.
q)(2;count x)#x,x
1 2 3
1 2 3
If you look at the parse tree of both your statements you can see that the second is evaluated differently. In the second only the result of count is passed as an argument to #.
q)parse"2 3#x,x"
#
2 3
(,;`x;`x)
q)parse"2 (count x)#x,x"
2
(#;(#:;`x);(,;`x;`x))
If you're looking to build matrices with identical rows you might be better off using
rownum#enlist x
q)x:100000?100
q)\ts do[100;v1:5 100000#x,x]
157 5767696j
q)\ts do[100;v2:5#enlist x]
0 992j
q)v1~v2
1b
I for one find this more natural (and its faster!)

Using SUM and UNIQUE to count occurrences of value within subset of a matrix

So, presume a matrix like so:
20 2
20 2
30 2
30 1
40 1
40 1
I want to count the number of times 1 occurs for each unique value of column 1. I could do this the long way by [sum(x(1:2,2)==1)] for each value, but I think this would be the perfect use for the UNIQUE function. How could I fix it so that I could get an output like this:
20 0
30 1
40 2
Sorry if the solution seems obvious, my grasp of loops is very poor.
Indeed unique is a good option:
u=unique(x(:,1))
res=arrayfun(#(y)length(x(x(:,1)==y & x(:,2)==1)),u)
Taking apart that last line:
arrayfun(fun,array) applies fun to each element in the array, and puts it in a new array, which it returns.
This function is the function #(y)length(x(x(:,1)==y & x(:,2)==1)) which finds the length of the portion of x where the condition x(:,1)==y & x(:,2)==1) holds (called logical indexing). So for each of the unique elements, it finds the row in X where the first is the unique element, and the second is one.
Try this (as specified in this answer):
>>> [c,~,d] = unique(a(a(:,2)==1))
c =
30
40
d =
1
3
>>> counts = accumarray(d(:),1,[],#sum)
counts =
1
2
>>> res = [c,counts]
Consider you have an array of various integers in 'array'
the tabulate function will sort the unique values and count the occurances.
table = tabulate(array)
look for your unique counts in col 2 of table.