Efficient method to query percentile in a list - kdb

I've come across the requirement to collect the percentiles from a list a few times:
Within what percentile is a certain number?
What is the nth percentile in a list?
I have written these methods to solve the issue:
/for 1:
percentileWithinThreshold:{[threshold;list] (100 * count where list <= threshold) % count list};
/for 2:
thresholdForPercentile:{[percentile;list] (asc list)[-1 + "j"$((percentile % 100) * count list)]};
They work well for both use cases, but I was thinking this is a too common use case, so probably Q offers already something out of the box that does the same. Any idea if there already exists something else?

'100 xrank' generates percentiles.
q) 100 xrank 1 2 3 4
q) 0 25 50 75
Solution for your second requirement:
q) f:{ y (100 xrank y:asc y) bin x}
Also, note that your second function result will not be always same as xrank. Reason for that is 'xrank' uses floor for fractional index output which is the normal scenario with calculating percentiles and your function round up the value and subtracts -1 which ensures that output will always be lesser-equal to input percentile. For example:
q) thresholdForPercentile[63;til 21] / output 12
q) f[63;til 21] / output 13
For first requirement, there is no inbuilt function. However you could improve your function if you keep your input list sorted because in that case you could use 'bin' function which runs faster on big lists.
q) percentileWithinThreshold:{[threshold;list] (100 * 1+list bin threshold) % count list};
Remember that 'bin' will throw type error if one argument is of float type and other is an integer. So make sure to cast them correctly inside the function.

qtln:{[x;y;z]cf:(0 1;1%2 2;0 0;1 1;1%3 3;3%8 8) z-4;n:count y:asc y;?[hf<1;first y;last y]^y[hf-1]+(h-hf)*y[hf]-y -1+hf:floor h:cf[0]+x*n+1f-sum cf}
qtl:qtln[;;8];

Related

SML Uncaught exception Empty homework1

Question: Write a function number_before_reaching_sum that takes an int called sum, which you can assume
is positive, and an int list, which you can assume contains all positive numbers, and returns an int.
You should return an int n such that the first n elements of the list add to less than sum, but the first
n + 1 elements of the list add to sum or more. Assume the entire list sums to more than the passed in
value; it is okay for an exception to occur if this is not the case.
I am quit new on SML, and coudn't find out anything wrong with this simple exprssion. The error message Please help me to debug the code below
fun number_before_reaching_sum (sum:int, xl: int list) =
if hd xl = sum
then 0
else
(hd xl) + number_before_reaching_sum(sum, (tl xl))
Try a couple of steps of your solution on a short list:
number_before_reaching_sum (6, [2,3,4])
--> if 2 = 6
then 0
else 2 + number_before_reaching_sum(6, [3,4])
--> 2 + if 3 = 6
then 0
else 3 + number_before_reaching_sum(6, [4])
--> ...
and you see pretty clearly that this is wrong - the elements of the list should not be added up, and you can't keep looking for the same sum in every tail.
You should return an int n such that the first n elements of the list add to less than sum, but the first n + 1 elements of the list add to sum or more.
This means that the result is 0 if the head is greater than or equal to the sum,
if hd xl >= sum
then 0
Otherwise, the index is one more, not hd xl more, than the index in the tail.
Also the "tail sum" you're looking for isn't the original sum, but the sum without hd xl.
else 1 + number_before_reaching_sum(sum - hd xl, tl xl)

KDB - Automatic function argument behavior with Iterators

I'm struggling to understand the behavior of the arguments in the below scan function. I understand the EWMA calc and have made an Excel worksheet to match in an attempt to try to understand but the kdb syntax is throwing me off in terms of what (and when) is x,y and z. I've referenced Q for Mortals, books and https://code.kx.com/q/ref/over/ and I do understand whats going on in the simpler examples provided.
I understand the EWMA formula based on the Excel calc but how is that translated into the function below?
x = constant, y= passed in values (but also appears to be prior result?) and z= (prev period?)
ewma: {{(y*1-x)+(z*x)} [x]\[y]};
ewma [.25; 15 20 25 30 35f]
15 16.25 18.4375 21.32813 24.74609
Rearranging terms makes it easier to read but if I were write this in Excel, I would incorrectly reference the y value column in the addition operator instead of correctly referencing the prev EWMA value.
ewma: {{y+x*z-y} [x]\[y]};
ewma [.25; 15 20 25 30 35f]
15 16.25 18.4375 21.32813 24.74609
EWMA in Excel formula for auditing
0N! is useful in these cases for determining variables passed. Simply add to start of function to display variable in console. EG. to show what z is being passed in as each run:
q)ewma: {{0N!z;(y*1-x)+(z*x)} [x]\[y]};
q)ewma [.25; 15 20 25 30 35f]
15f
16.25
18.4375
21.32812
//Or multiple at once
q)ewma: {{0N!(x;y;z);(y*1-x)+(z*x)} [x]\[y]};
q)
q)ewma [.25; 15 20 25 30 35f]
0.25 15 20
0.25 16.25 25
0.25 18.4375 30
0.25 21.32812 35
Edit:
To think about why z is holding 'y' values it is best to think about below simplified example using just x/y.
//two parameters specified in beginning.
//x initialised as 1 then takes the function result for next run
//y takes value of next value in list
q){0N!(x;y);x+y}\[1;2 3 4]
1 2
3 3
6 4
3 6 10
//in this example only one parameter is passed
//but q takes first value in list as x in this special case
q){0N!(x;y);x+y}\[1 2 3 4]
1 2
3 3
6 4
1 3 6 10
A similar occurrence is happening in your example. x is not being passed to the the iterator and therefore will assume the same value in each run.
The inner function y value will be initilised taking the first value of the outer y variable (15f in this case) like above simplified example. Then the z takes the 2nd value of the list for it's initial run. y then takes the result of previous function run and z takes the next value in the list until how list has bee passed to function.

Save outputs of nested for loops in MATLAB

I have the following codes which I wish to have an output matrix Rpp of (10201,3). I run this code (which takes a bit long) then I check the matrix size of Rpp and I see (1,3), I tried so many things I couldn't find any proper way. The logic of the codes is to take the 6 values (contain 4 constant values and 2 variable values (chosen from 101 values)) and make the calculation for 3 different i1 and store every output vector of 3 in a matrix with (101*101 (pairs of those 2 variable values)) rows and 3 (for each i1) columns.
I appreciate your help
Vp1=linspace(3000,3500,101);
Vp2=3850;
rho1=2390;
rho2=2510;
Vs1=linspace(1250,1750,101);
Vs2=2000;
i1=[10 25 40];
Rpp = zeros(length(Vp1)*length(Vs1),length (i1));
for n=1:length(Vp1)*length(Vs1)
for m=1:length (i1)
for l=1:length(Vp1)
for k=1:length(Vs1)
p=sin(i1)/Vp1(l);
i2=asin(p*Vp2);
j1=asin(p*Vs1(k));
j2=asin(p*Vs2);
a=rho2*(1-2*Vs2^2*p.^2)-rho1*(1-2*Vs1(k).^2*p.^2);
b=rho2*(1-2*Vs2^2*p.^2)+2*rho1*Vs1(k)^2*p.^2;
c=rho1*(1-2*Vs1(k)^2*p.^2)+2*rho2*Vs2^2*p.^2;
d=2*(rho2*Vs2^2-rho1*Vs1(k)^2);
E=b.*cos(i1)./Vp1(l)+c.*cos(i2)/Vp2;
F=b.*cos(j1)./Vs1(k)+c.*cos(j2)/Vs2;
G=a-d*(cos(i1)/Vp1(l)).*(cos(j2)/Vs2);
H=a-d*(cos(i2)/Vp2).*(cos(j1)/Vs1(k));
D=E.*F+G.*H.*p.^2;
Rpp=((b.*(cos(i1)/Vp1(l))-c.*cos((i2)/Vp2)).*F-(a+d*((cos(i1)/Vp1(l))).*(cos(j2)/Vs2)).*H.*p.^2)./D
end
end
end
end
Try this. You 2 outer loops didn't do anything. You never used m or n so I killed those 2 loops. Also you just kept overwriting Rpp on every loop so your initialization of Rpp didn't do anything. I added an index var to assign the results to the equation to what I think is the correct part of Rpp.
Vp1=linspace(3000,3500,101);
Vp2=3850;
rho1=2390;
rho2=2510;
Vs1=linspace(1250,1750,101);
Vs2=2000;
i1=[10 25 40];
Rpp = zeros(length(Vp1)*length(Vs1),length (i1));
index = 1;
for l=1:length(Vp1)
for k=1:length(Vs1)
p=sin(i1)/Vp1(l);
i2=asin(p*Vp2);
j1=asin(p*Vs1(k));
j2=asin(p*Vs2);
a=rho2*(1-2*Vs2^2*p.^2)-rho1*(1-2*Vs1(k).^2*p.^2);
b=rho2*(1-2*Vs2^2*p.^2)+2*rho1*Vs1(k)^2*p.^2;
c=rho1*(1-2*Vs1(k)^2*p.^2)+2*rho2*Vs2^2*p.^2;
d=2*(rho2*Vs2^2-rho1*Vs1(k)^2);
E=b.*cos(i1)./Vp1(l)+c.*cos(i2)/Vp2;
F=b.*cos(j1)./Vs1(k)+c.*cos(j2)/Vs2;
G=a-d*(cos(i1)/Vp1(l)).*(cos(j2)/Vs2);
H=a-d*(cos(i2)/Vp2).*(cos(j1)/Vs1(k));
D=E.*F+G.*H.*p.^2;
Rpp(index,:)=((b.*(cos(i1)/Vp1(l))-c.*cos((i2)/Vp2)).*F-(a+d*((cos(i1)/Vp1(l))).*(cos(j2)/Vs2)).*H.*p.^2)./D;
index = index+1;
end
end
Results:
>> size(Rpp)
ans =
10201 3
The way you use the for loop is wrong. You're running the calculation for length(Vp1)*length(Vs1) * length (i1) * length(Vp1) * length(Vs1) times. Here's the correct way. I changed l into lll just so I won't confuse it with the number 1. In each iteration of the first for loop, you're running length(Vs1) times, and you need to assign the result (a 1X3 array) to the Rpp by using a row number specified by k+(lll-1)*length(Vp1).
for lll=1:length(Vp1)
for k=1:length(Vs1)
p=sin(i1)/Vp1(lll);
i2=asin(p*Vp2);
j1=asin(p*Vs1(k));
j2=asin(p*Vs2);
a=rho2*(1-2*Vs2^2*p.^2)-rho1*(1-2*Vs1(k).^2*p.^2);
b=rho2*(1-2*Vs2^2*p.^2)+2*rho1*Vs1(k)^2*p.^2;
c=rho1*(1-2*Vs1(k)^2*p.^2)+2*rho2*Vs2^2*p.^2;
d=2*(rho2*Vs2^2-rho1*Vs1(k)^2);
E=b.*cos(i1)./Vp1(lll)+c.*cos(i2)/Vp2;
F=b.*cos(j1)./Vs1(k)+c.*cos(j2)/Vs2;
G=a-d*(cos(i1)/Vp1(lll)).*(cos(j2)/Vs2);
H=a-d*(cos(i2)/Vp2).*(cos(j1)/Vs1(k));
D=E.*F+G.*H.*p.^2;
Rpp(k+(lll-1)*length(Vp1),:)=((b.*(cos(i1)/Vp1(lll))-c.*cos((i2)/Vp2)).*F-(a+d*((cos(i1)/Vp1(lll))).*(cos(j2)/Vs2)).*H.*p.^2)./D;
end
end

How to do sum of list subset in kdb?

If you have a list and another list with indices (limited number) in of the first list in ascending order.
How can you get a sum of elements in the first list between consecutive indices in the second list.
e.g:
list1: til 100;
idx: (1 20 50 70 100);
How can we get a list with sum of elements of list from elements 1:20, 20:50, 50:70, 70:100?
The obvious approach would be to use # and _ on elements of the idx but can we do that iteratively somehow without using first, first 1_idx etc.
Something like this would work:
q)sum each idx cut list1
190 1035 1190 2535 0
cut operates by cutting the second argument at the indices given in the first. Hence why you see the 0 at the end of the result, as it's cutting at the 100th element.

Calculating prime numbers in Scala: how does this code work?

So I've spent hours trying to work out exactly how this code produces prime numbers.
lazy val ps: Stream[Int] = 2 #:: Stream.from(3).filter(i =>
ps.takeWhile{j => j * j <= i}.forall{ k => i % k > 0});
I've used a number of printlns etc, but nothings making it clearer.
This is what I think the code does:
/**
* [2,3]
*
* takeWhile 2*2 <= 3
* takeWhile 2*2 <= 4 found match
* (4 % [2,3] > 1) return false.
* takeWhile 2*2 <= 5 found match
* (5 % [2,3] > 1) return true
* Add 5 to the list
* takeWhile 2*2 <= 6 found match
* (6 % [2,3,5] > 1) return false
* takeWhile 2*2 <= 7
* (7 % [2,3,5] > 1) return true
* Add 7 to the list
*/
But If I change j*j in the list to be 2*2 which I assumed would work exactly the same, it causes a stackoverflow error.
I'm obviously missing something fundamental here, and could really use someone explaining this to me like I was a five year old.
Any help would be greatly appreciated.
I'm not sure that seeking a procedural/imperative explanation is the best way to gain understanding here. Streams come from functional programming and they're best understood from that perspective. The key aspects of the definition you've given are:
It's lazy. Other than the first element in the stream, nothing is computed until you ask for it. If you never ask for the 5th prime, it will never be computed.
It's recursive. The list of prime numbers is defined in terms of itself.
It's infinite. Streams have the interesting property (because they're lazy) that they can represent a sequence with an infinite number of elements. Stream.from(3) is an example of this: it represents the list [3, 4, 5, ...].
Let's see if we can understand why your definition computes the sequence of prime numbers.
The definition starts out with 2 #:: .... This just says that the first number in the sequence is 2 - simple enough so far.
The next part defines the rest of the prime numbers. We can start with all the counting numbers starting at 3 (Stream.from(3)), but we obviously need to filter a bunch of these numbers out (i.e., all the composites). So let's consider each number i. If i is not a multiple of a lesser prime number, then i is prime. That is, i is prime if, for all primes k less than i, i % k > 0. In Scala, we could express this as
nums.filter(i => ps.takeWhile(k => k < i).forall(k => i % k > 0))
However, it isn't actually necessary to check all lesser prime numbers -- we really only need to check the prime numbers whose square is less than or equal to i (this is a fact from number theory*). So we could instead write
nums.filter(i => ps.takeWhile(k => k * k <= i).forall(k => i % k > 0))
So we've derived your definition.
Now, if you happened to try the first definition (with k < i), you would have found that it didn't work. Why not? It has to do with the fact that this is a recursive definition.
Suppose we're trying to decide what comes after 2 in the sequence. The definition tells us to first determine whether 3 belongs. To do so, we consider the list of primes up to the first one greater than or equal to 3 (takeWhile(k => k < i)). The first prime is 2, which is less than 3 -- so far so good. But we don't yet know the second prime, so we need to compute it. Fine, so we need to first see whether 3 belongs ... BOOM!
* It's pretty easy to see that if a number n is composite then the square of one of its factors must be less than or equal to n. If n is composite, then by definition n == a * b, where 1 < a <= b < n (we can guarantee a <= b just by labeling the two factors appropriately). From a <= b it follows that a^2 <= a * b, so it follows that a^2 <= n.
Your explanations are mostly correct, you made only two mistakes:
takeWhile doesn't include the last checked element:
scala> List(1,2,3).takeWhile(_<2)
res1: List[Int] = List(1)
You assume that ps always contains only a two and a three but because Stream is lazy it is possible to add new elements to it. In fact each time a new prime is found it is added to ps and in the next step takeWhile will consider this new added element. Here, it is important to remember that the tail of a Stream is computed only when it is needed, thus takeWhile can't see it before forall is evaluated to true.
Keep these two things in mind and you should came up with this:
ps = [2]
i = 3
takeWhile
2*2 <= 3 -> false
forall on []
-> true
ps = [2,3]
i = 4
takeWhile
2*2 <= 4 -> true
3*3 <= 4 -> false
forall on [2]
4%2 > 0 -> false
ps = [2,3]
i = 5
takeWhile
2*2 <= 5 -> true
3*3 <= 5 -> false
forall on [2]
5%2 > 0 -> true
ps = [2,3,5]
i = 6
...
While these steps describe the behavior of the code, it is not fully correct because not only adding elements to the Stream is lazy but every operation on it. This means that when you call xs.takeWhile(f) not all values until the point when f is false are computed at once - they are computed when forall wants to see them (because it is the only function here that needs to look at all elements before it definitely can result to true, for false it can abort earlier). Here the computation order when laziness is considered everywhere (example only looking at 9):
ps = [2,3,5,7]
i = 9
takeWhile on 2
2*2 <= 9 -> true
forall on 2
9%2 > 0 -> true
takeWhile on 3
3*3 <= 9 -> true
forall on 3
9%3 > 0 -> false
ps = [2,3,5,7]
i = 10
...
Because forall is aborted when it evaluates to false, takeWhile doesn't calculate the remaining possible elements.
That code is easier (for me, at least) to read with some variables renamed suggestively, as
lazy val ps: Stream[Int] = 2 #:: Stream.from(3).filter(i =>
ps.takeWhile{p => p * p <= i}.forall{ p => i % p > 0});
This reads left-to-right quite naturally, as
primes are 2, and those numbers i from 3 up, that all of the primes p whose square does not exceed the i, do not divide i evenly (i.e. without some non-zero remainder).
In a true recursive fashion, to understand this definition as defining the ever increasing stream of primes, we assume that it is so, and from that assumption we see that no contradiction arises, i.e. the truth of the definition holds.
The only potential problem after that, is the timing of accessing the stream ps as it is being defined. As the first step, imagine we just have another stream of primes provided to us from somewhere, magically. Then, after seeing the truth of the definition, check that the timing of the access is okay, i.e. we never try to access the areas of ps before they are defined; that would make the definition stuck, unproductive.
I remember reading somewhere (don't recall where) something like the following -- a conversation between a student and a wizard,
student: which numbers are prime?
wizard: well, do you know what number is the first prime?
s: yes, it's 2.
w: okay (quickly writes down 2 on a piece of paper). And what about the next one?
s: well, next candidate is 3. we need to check whether it is divided by any prime whose square does not exceed it, but I don't yet know what the primes are!
w: don't worry, I'l give them to you. It's a magic I know; I'm a wizard after all.
s: okay, so what is the first prime number?
w: (glances over the piece of paper) 2.
s: great, so its square is already greater than 3... HEY, you've cheated! .....
Here's a pseudocode1 translation of your code, read partially right-to-left, with some variables again renamed for clarity (using p for "prime"):
ps = 2 : filter (\i-> all (\p->rem i p > 0) (takeWhile (\p->p^2 <= i) ps)) [3..]
which is also
ps = 2 : [i | i <- [3..], and [rem i p > 0 | p <- takeWhile (\p->p^2 <= i) ps]]
which is a bit more visually apparent, using list comprehensions. and checks that all entries in a list of Booleans are True (read | as "for", <- as "drawn from", , as "such that" and (\p-> ...) as "lambda of p").
So you see, ps is a lazy list of 2, and then of numbers i drawn from a stream [3,4,5,...] such that for all p drawn from ps such that p^2 <= i, it is true that i % p > 0. Which is actually an optimal trial division algorithm. :)
There's a subtlety here of course: the list ps is open-ended. We use it as it is being "fleshed-out" (that of course, because it is lazy). When ps are taken from ps, it could potentially be a case that we run past its end, in which case we'd have a non-terminating calculation on our hands (a "black hole"). It just so happens :) (and needs to ⁄ can be proved mathematically) that this is impossible with the above definition. So 2 is put into ps unconditionally, so there's something in it to begin with.
But if we try to "simplify",
bad = 2 : [i | i <- [3..], and [rem i p > 0 | p <- takeWhile (\p->p < i) bad]]
it stops working after producing just one number, 2: when considering 3 as the candidate, takeWhile (\p->p < 3) bad demands the next number in bad after 2, but there aren't yet any more numbers there. It "jumps ahead of itself".
This is "fixed" with
bad = 2 : [i | i <- [3..], and [rem i p > 0 | p <- [2..(i-1)] ]]
but that is a much much slower trial division algorithm, very far from the optimal one.
--
1 (Haskell actually, it's just easier for me that way :) )