How to precalculate expensive Expressions in Polars (in groupby-s and in general)? - python-polars

I'm having a hard time dealing with the fact that in a groupby I cant efficiently catch a group sub-dataframe with an Expr, perform an expensive operation with it once and then return several different aggregations. I can sort of do it (see example), but my solution is unreadable and looks like Im dealing with an unnecessary overhead because of all those lists. Is there a proper or a completely different way to do it?
Take a look at this example:
import polars as pl
import numpy as np
df = pl.DataFrame(np.random.randint(0,10,size=(1000000, 3)))
expensive = pl.col('column_1').cumprod().ewm_std(span=10).alias('expensive')
%%timeit
(
df
.groupby('column_0')
.agg([
expensive.sum().alias('sum'),
expensive.median().alias('median'),
*[expensive.max().alias(f'max{x}') for x in range(10)],
])
)
417 ms ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(
df
.groupby('column_0')
.agg(expensive)
.select([
pl.col('expensive').arr.eval(pl.element().sum()).arr.first().alias('sum'),
pl.col('expensive').arr.eval(pl.element().median()).arr.first().alias('median'),
*[pl.col('expensive').arr.eval(pl.element().max()).arr.first().alias(f'max{x}') for x in range(10)]
])
)
95.5 ms ± 9.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can see that precomputing the expensive part is beneficial, but actually doing it involves this .arr.eval(pl.element().<aggfunc>()).arr.first() that really bothers me because of both readability and flexibility. Try as I might, I cant see a better solution.
I'm not sure whether the problem is just about groupbys, if your solution involves dealing with selects, please share that also.

Use explode instead of arr.eval like this:
%%timeit
df \
.groupby('column_0') \
.agg(expensive).explode('expensive').groupby('column_0').agg([
pl.col('expensive').sum().alias('sum'),
pl.col('expensive').median().alias('median'),
*[pl.col('expensive').max().alias(f'max{x}') for x in range(10)]
])
On my machine the run times were
Your first example: 320 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Your second: 80.8 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Mine: 63 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another method which turns out to be slightly slower than the above is to do the expensive expression has a window function which then skips the explode
%%timeit
df.select(['column_0',expensive.over('column_0')]).groupby('column_0').agg([
pl.col('expensive').sum().alias('sum'),
pl.col('expensive').median().alias('median'),
*[pl.col('expensive').max().alias(f'max{x}') for x in range(10)]
])
This last one returned in 69.7 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Is the complexity of kdb's moving max function mmax O(n)?

I used function mmax to calculate moving max of a 10-million-length integer vector. I ran it 10 times to calculate the total execution time. The running time for window size 132 (15,025 milliseconds) is 6 times longer than for window size 22 (2,425 milliseconds). It seems the complexity of mmax is O(nw) rather than O(n), where w is the length of the sliding window).
To check if this is true for other similar products, I tried the same experiment on DolphinDB, a time series database with built-in analytics features (https://www.dolphindb.com/downloads.html ). In contrast, DolphinDB’s mmax has linear complexity O(n), regardless of the window size: 1,277 milliseconds (window size 132) and 1,233 milliseconds (window size 22).
The hardware being used for this experiment:
Server: Dell PowerEdge R630
Architecure: x86_64
CPU Model Name: Intel(R) Xeon(R) CPU E5-2650 v4 # 2.20GHz
Total logical CPU cores: 48
Total memory: 256G
Experiment setup
I used KDB+ 4.0 64 bit version and DolphinDB_Linux_V2.00.7(DolphinDB community version: 2 cores and 8GB memory). Both experiments are conducted using 2 cores of CPU.
KDB implementation
// Start the server
rlwrap -r taskset -c 0,1 ./l64/q -p 5002 -s 2
// The code
a:10000000?10000i
\t do[10; 22 mmax a]
2425
\t do[10; 132 mmax a]
15025
DolphinDB implementation
// Start the server
rlwrap -r ./dolphindb -localSite localhost:5002:local5002 -localExecutors 1
// The code
a=rand(10000,10000000)
timer(10) mmax(a,22);
1232.83 ms
timer(10) mmax(a,132);
1276.53 ms
Can any kdb expert confirm the complexity of function mmax? If the built-in function mmax does have the complexity of O(nw), any third-party plugin for kdb to improve the performance?
Yes, it would scale with the size of the window as well as the size of the list. If you look at the definition of mmax:
q)mmax
k){(x-1)|':/y}
it is "equivalent" to
q)a:8 1 9 5 4 6 6 1 8 5
q)w:3
q)mmax[w;a]~{{max x,y}':[x]}/[w-1;a]
1b
which can more clearly be understood as the last output of:
q){{max x,y}':[x]}\[w-1;a]
8 1 9 5 4 6 6 1 8 5
8 8 9 9 5 6 6 6 8 8
8 8 9 9 9 6 6 6 8 8
....take the max of each item with its previous item, {max x,y}':[x]
....then do that same operation again on the output, {}\
....do the same operation again on the output (w-1) times, \[w-1;a]
From that it's clear that the window size impacts the number of times the loop is performed. As for faster implementation, there might be a different but less "elegant" algorithm which does it quicker and could be written in k/q. Otherwise you could import an implementation written in C - see https://code.kx.com/q/ref/dynamic-load/

ipython / Question about the %timeit looping process

Here is my code in Google Colab :
myArray=[]
Then
%%timeit -n 2
myArray.append("1")
The resultat gives :
Which I don't really understand (I was expecting only two values for myArray)
Timeit has two arguments you can pass to tell it how many times the code should be run: number (-n) and repeat (-r).
repeat tells timeit how many samples it should take
number specifies the number of times to repeat the code for each sample
Now, the default repeat value is 5. So, when number is 2 and repeat is 5, 2*5=10, which is the number of times the code is actually run and also the number of elements that get appended to the list.
To fix this you should also specify the repeat argument with -r.
Edit
For every sample you take (-r), you also run the setup code you may have passed to timeit. On the other hand, the number (-n) tells timeit how many times it should run your code for every sample. Your code is executed n times only after the setup, which is done r times, once for every sample.
From the timeit documentation:
timeit.timeit(stmt='pass', setup='pass', timer=<default timer>, number=1000000, globals=None)
This is the timeit function signature. As you can see, there is a setup parameter you can pass to specify any code that should be run prior to executing your code (which will be executed number times, equivalent to -n) in the sample.
Let's now compare it with the timeit.repeat function:
timeit.repeat(stmt='pass', setup='pass', timer=<default timer>, repeat=5, number=1000000, globals=None)
As you can see, there is an extra parameter here: repeat. Note that the default value of repeat (equivalent to -r) is 5, and this is why you get 10 elements appended to your list in your example.
Why should you use both -r and -n?
It's better to specify both the number of runs per sample, as well as the samples to take for comparability reasons. Think of every sample as you executing your Python script: Python has to load the script, perform some initial setup, and only then does it run your code.
You can also think of the number (-n) as the number of iterations in a for loop: the setup has already been done prior to running your code.
Here's a simplified Python-like pseudocode representation of what's happening when you use the timeit module:
def timeit(your_code, repeat, number, setup):
for r in range(repeat):
perform_setup()
for n in range(number):
run(your_code)
Hope this helps, cheers.
timeit has loops and runs. You just specified the loops per run.
> In [80]: alist = []
In [81]: %%timeit
...: alist.append(1)
...:
...:
146 ns ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [82]: len(alist)
Out[82]: 81111111
In [83]: alist=[]
In [84]: %%timeit -n2 -r1
...: alist.append(1)
...:
...:
1.75 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)
In [85]: len(alist)
Out[85]: 2
In [86]: alist=[]
In [87]: %%timeit -n2
...: alist.append(1)
...:
...:
993 ns ± 129 ns per loop (mean ± std. dev. of 7 runs, 2 loops each)
In [88]: len(alist)
Out[88]: 14
Generally I try to setup a timeit so I don't care what the "result" is, since I want the times, not some sort of accumulated list or array.
For a fresh list each run:
In [89]: %%timeit -n2 -r10 alist=[]
...: alist.append(1)
...:
...:
The slowest run took 4.75 times longer than the fastest. This could mean that an intermediate result is being cached.
1.21 µs ± 889 ns per loop (mean ± std. dev. of 10 runs, 2 loops each)

Efficiently implementing Matlab's "Find" function in Julia

I'm trying to implement Matlab's Find function in Julia. In Matlab, the code is
find(A==0)
where A is a very, very large n by m matrix, and where I iterate and update the above over a series of about 500 steps. In Julia, I implement the above via
[findall(x->x==0, D_tot)[j][2] for j in 1:count(x->x==0,D_tot)]
This seems to work nicely, except it goes very slow as I progress with my iteration. For example, for the first step, #time yields
0.000432 seconds (33 allocations: 3.141 KiB)
Step 25:
0.546958 seconds (40.37 k allocations: 389.997 MiB, 7.40% gc time)
Step 65:
1.765892 seconds (86.73 k allocations: 1.516 GiB, 9.63% gc time)
At each step, A remains the same size but becomes more complex, and Julia seems to have trouble finding the zeroes. Is there a better way of implementing Matlab's "Find" function than what I did above?
Going through the Matlab documentation I understand that you want to find
"a vector containing the linear indices of each nonzero element in array X"
and by non-zero you meant true values in Matlab's expression A==0
In that case this can be accomplished as
findall(==(0),vec(D_tot))
And a small benchmark:
D_tot=rand(0:100,1000,1000)
using BenchmarkTools
Running:
julia> #btime findall(==(0), vec($D_tot));
615.100 μs (17 allocations: 256.80 KiB)
julia> #btime findall(iszero, vec($D_tot));
665.799 μs (17 allocations: 256.80 KiB)

How to compute & plot Equal Error Rate (EER) from FAR/FRR values using matlab

I have the following values against FAR/FRR. i want to compute EER rates and then plot in matlab.
FAR FRR
19.64 20
21.29 18.61
24.92 17.08
19.14 20.28
17.99 21.39
16.83 23.47
15.35 26.39
13.20 29.17
7.92 42.92
3.96 60.56
1.82 84.31
1.65 98.33
26.07 16.39
29.04 13.13
34.49 9.31
40.76 6.81
50.33 5.42
66.83 1.67
82.51 0.28
Is there any matlab function available to do this. can somebody explain this to me. Thanks.
Let me try to answer your question
1) For your data EER can be the mean/max/min of [19.64,20]
1.1) The idea of EER is try to measure the system performance against another system (the lower the better) by finding the equal(if not equal then at least nearly equal or have the min distance) between False Alarm Rate (FAR) and False Reject Rate (FRR, or missing rate) .
Refer to your data, [19.64,20] gives min distance, thus it could used as EER, you can take mean/max/min value of these two value, however since it means to compare between systems, thus make sure other system use the same method(mean/max/min) to pick EER value.
The difference among mean/max/min can be ignored if the there are large amount of data. In some speaker verification task, there will be 100k data sample.
2) To understand EER ,better compute it by yourself, here is how:
two things you need to know:
A) The system score for each test case (trial)
B) The true/false for each trial
After you have A and B, then you can create [trial, score,true/false] pairs then sort it by the score value, after that loop through the score, eg from min-> max. At each loop assume threshold is that score and compute the FAR,FRR. After loop through the score find the FAR,FRR with "equal" value.
For the code you can refer to my pyeer.py , in function processDataTable2
https://github.com/StevenLOL/Research_speech_speaker_verification_nist_sre2010/blob/master/SRE2010/sid/pyeer.py
This function is written for the NIST SRE 2010 evaluation.
4) There are other measures similar to EER, such as minDCF which only play with the weights of FAR and FRR. You can refer to "Performance Measure" of http://www.nist.gov/itl/iad/mig/sre10results.cfm
5) You can also refer to this package https://sites.google.com/site/bosaristoolkit/ and DETware_v2.1.tar.gz at http://www.itl.nist.gov/iad/mig/tools/ for computing and plotting EER in Matlab
Plotting in DETWare_v2.1
Pmiss=1:50;Pfa=50:-1:1;
Plot_DET(Pmiss/100.0,Pfa/100.0,'r')
FAR(t) and FRR(t) are parameterized by threshold, t. They are cumulative distributions, so they should be monotonic in t. Your data is not shown to be monotonic, so if it is indeed FAR and FRR, then the measurements were not made in order. But for the sake of clarity, we can order:
FAR FRR
1 1.65 98.33
2 1.82 84.31
3 3.96 60.56
4 7.92 42.92
5 13.2 29.17
6 15.35 26.39
7 16.83 23.47
8 17.99 21.39
9 19.14 20.28
10 19.64 20
11 21.29 18.61
12 24.92 17.08
13 26.07 16.39
14 29.04 13.13
15 34.49 9.31
16 40.76 6.81
17 50.33 5.42
18 66.83 1.67
19 82.51 0.28
This is for increasing FAR, which assumes a distance score; if you have a similarity score, then FAR would be sorted in decreasing order.
Loop over FAR until it is larger than FRR, which occurs at row 11. Then interpolate the cross over value between rows 10 and 11. This is your equal error rate.

MATLAB repeat numbers based on a vector of lengths

Is there a vectorised way to do the following? (shown by an example):
input_lengths = [ 1 1 1 4 3 2 1 ]
result = [ 1 2 3 4 4 4 4 5 5 5 6 6 7 ]
I have spaced out the input_lengths so it is easy to understand how the result is obtained
The resultant vector is of length: sum(lengths). I currently calculate result using the following loop:
result = ones(1, sum(input_lengths ));
counter = 1;
for i = 1:length(input_lengths)
start_index = counter;
end_index = counter + input_lengths (i) - 1;
result(start_index:end_index) = i;
counter = end_index + 1;
end
EDIT:
I can also do this using arrayfun (although that is not exactly a vectorised function)
cell_result = arrayfun(#(x) repmat(x, 1, input_lengths(x)), 1:length(input_lengths), 'UniformOutput', false);
cell_result : {[1], [2], [3], [4 4 4 4], [5 5 5], [6 6], [7]}
result = [cell_result{:}];
result : [ 1 2 3 4 4 4 4 5 5 5 6 6 7 ]
A fully vectorized version:
selector=bsxfun(#le,[1:max(input_lengths)]',input_lengths);
V=repmat([1:size(selector,2)],size(selector,1),1);
result=V(selector);
Downside is, the memory usage is O(numel(input_lengths)*max(input_lengths))
Benchmark of all solutions
Following the previous benchmark, I group all solutions given here in a script and run it a few hours for a benchmark. I've done this because I think it's good to see what is the performance of each proposed solution with the input lenght as parameter - my intention is not here to put down the quality of the previous one, which gives additional information about the effect of JIT. Moreover, and every participant seems to agree with that, quite a good work was done in all answers, so this great post deserves a conclusion post.
I won't post the code of the script here, this is quite long and very uninteresting. The procedure of the benchmark is to run each solution for a set of different lengths of input vectors: 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000. For each input length, I've generated a random input vector based on Poisson law with parameter 0.8 (to avoid big values):
input_lengths = round(-log(1-rand(1,ILen(i)))/poisson_alpha)+1;
Finally, I average the computation times over 100 runs per input length.
I've run the script on my laptop computer (core I7) with Matlab R2013b; JIT is activated.
And here are the plotted results (sorry, color lines), in a log-log scale (x-axis: input length; y-axis: computation time in seconds):
So Luis Mendo is the clear winner, congrats!
For anyone who wants the numerical results and/or wants to replot them, here they are (cut the table into 2 parts and approximated to 3 digits, for a better display):
N 10 20 50 100 200 500 1e+03 2e+03
-------------------------------------------------------------------------------------------------------------
OP's for-loop 8.02e-05 0.000133 0.00029 0.00036 0.000581 0.00137 0.00248 0.00542
OP's arrayfun 0.00072 0.00117 0.00255 0.00326 0.00514 0.0124 0.0222 0.047
Daniel 0.000132 0.000132 0.000148 0.000118 0.000126 0.000325 0.000397 0.000651
Divakar 0.00012 0.000114 0.000132 0.000106 0.000115 0.000292 0.000367 0.000641
David's for-loop 9.15e-05 0.000149 0.000322 0.00041 0.000654 0.00157 0.00275 0.00622
David's arrayfun 0.00052 0.000761 0.00152 0.00188 0.0029 0.00689 0.0122 0.0272
Luis Mendo 4.15e-05 4.37e-05 4.66e-05 3.49e-05 3.36e-05 4.37e-05 5.87e-05 0.000108
Bentoy13's cumsum 0.000104 0.000107 0.000111 7.9e-05 7.19e-05 8.69e-05 0.000102 0.000165
Bentoy13's sparse 8.9e-05 8.82e-05 9.23e-05 6.78e-05 6.44e-05 8.61e-05 0.000114 0.0002
Luis Mendo's optim. 3.99e-05 3.96e-05 4.08e-05 4.3e-05 4.61e-05 5.86e-05 7.66e-05 0.000111
N 5e+03 1e+04 2e+04 5e+04 1e+05 2e+05 5e+05 1e+06
-------------------------------------------------------------------------------------------------------------
OP's for-loop 0.0138 0.0278 0.0588 0.16 0.264 0.525 1.35 2.73
OP's arrayfun 0.118 0.239 0.533 1.46 2.42 4.83 12.2 24.8
Daniel 0.00105 0.0021 0.00461 0.0138 0.0242 0.0504 0.126 0.264
Divakar 0.00127 0.00284 0.00655 0.0203 0.0335 0.0684 0.185 0.396
David's for-loop 0.015 0.0286 0.065 0.175 0.3 0.605 1.56 3.16
David's arrayfun 0.0668 0.129 0.299 0.803 1.33 2.64 6.76 13.6
Luis Mendo 0.000236 0.000446 0.000863 0.00221 0.0049 0.0118 0.0299 0.0637
Bentoy13's cumsum 0.000318 0.000638 0.00107 0.00261 0.00498 0.0114 0.0283 0.0526
Bentoy13's sparse 0.000414 0.000774 0.00148 0.00451 0.00814 0.0191 0.0441 0.0877
Luis Mendo's optim. 0.000224 0.000413 0.000754 0.00207 0.00353 0.00832 0.0216 0.0441
Ok, I've added another solution to the list ... I could not prevent myself to optimize the best-so-far solution of Luis Mendo. No credit for that, it's just a variant from Luis Mendo's, I'll explain it later.
Clearly, the solutions using arrayfun are very time-consuming. The solutions using an explicit for loop are faster, yet still slow compared with others solutions. So yes, vectorizing is still a major option for optimizing a Matlab script.
Since I've seen a big dispersion on the computing times of the fastest solutions, especially with input lengths between 100 and 10000, I decide to benchmark more precisely. So I've put the slowest apart (sorry), and redo the benchmark over the 6 other solutions which run much faster. The second benchmark over this reduced list of solutions is identical except that I've average over 1000 runs.
(No table here, unless you really want to, it's quite the same numbers as before)
As it was remarked, the solution by Daniel is a little faster than the one by Divakar because it seems that the use of bsxfun with #times is slower than using repmat. Still, they are 10 times faster than for-loop solutions: clearly, vectorizing in Matlab is a good thing.
The solutions of Bentoy13 and Luis Mendo are very close; the first one uses more instructions, but the second one uses an extra allocation when concatenating 1 to cumsum(input_lengths(1:end-1)). And that's why we see that Bentoy13's solution tends to be a bit faster with big input lengths (above 5.10^5), because there is no extra allocation. From this consideration, I've made an optimized solution where there is no extra allocation; here is the code (Luis Mendo can put this one in his answer if he wants to :) ):
result = zeros(1,sum(input_lengths));
result(1) = 1;
result(1+cumsum(input_lengths(1:end-1))) = 1;
result = cumsum(result);
Any comment for improvement is welcome.
More of a comment than anything, but I did some tests. I tried a for loop, and an arrayfun, and I tested your for loop and arrayfun version. Your for loop was the fastest. I think this is because it is simple, and allows the JIT compilation to do the most optimisation. I am using Matlab, octave might be different.
And the timing:
Solution: With JIT Without JIT
Sam for 0.74 1.22
Sam arrayfun 2.85 2.85
My for 0.62 2.57
My arrayfun 1.27 3.81
Divakar 0.26 0.28
Bentoy 0.07 0.06
Daniel 0.15 0.16
Luis Mendo 0.07 0.06
So Bentoy's code is really fast, and Luis Mendo's is almost exactly the same speed. And I rely on JIT way too much!
And the code for my attempts
clc,clear
input_lengths = randi(20,[1 10000]);
% My for loop
tic()
C=cumsum(input_lengths);
D=diff(C);
results=zeros(1,C(end));
results(1,1:C(1))=1;
for i=2:length(input_lengths)
results(1,C(i-1)+1:C(i))=i*ones(1,D(i-1));
end
toc()
tic()
A=arrayfun(#(i) i*ones(1,input_lengths(i)),1:length(input_lengths),'UniformOutput',false);
R=[A{:}];
toc()
result = zeros(1,sum(input_lengths));
result(cumsum([1 input_lengths(1:end-1)])) = 1;
result = cumsum(result);
This should be pretty fast. And memory usage is the minimum possible.
An optimized version of the above code, due to Bentoy13 (see his very detailed benchmarking):
result = zeros(1,sum(input_lengths));
result(1) = 1;
result(1+cumsum(input_lengths(1:end-1))) = 1;
result = cumsum(result);
This is a slight variant of #Daniel's answer. The crux of this solution is based on that solution. Now this one avoids repmat, so in that way it's little-more "vectorized" maybe. Here's the code -
selector=bsxfun(#le,[1:max(input_lengths)]',input_lengths); %//'
V = bsxfun(#times,selector,1:numel(input_lengths));
result = V(V~=0)
For all the desperate one-liner searching people -
result = nonzeros(bsxfun(#times,bsxfun(#le,[1:max(input_lengths)]',input_lengths),1:numel(input_lengths)))
I search an elegant solution, and I think David's solution is a good start. What I have in mind is that one can generate the indexes where to add one from previous element.
For that, if we compute the cumsum of the input vector, we get:
cumsum(input_lengths)
ans = 1 2 3 7 10 12 13
This is the indexes of the ends of sequences of identical numbers. That is not what we want, so we flip the vector twice to get the beginnings:
fliplr(sum(input_lengths)+1-cumsum(fliplr(input_lengths)))
ans = 1 2 3 4 8 11 13
Here is the trick. You flip the vector, cumsum it to get the ends of the flipped vector, and then flip back; but you must substract the vector from the total length of the output vector (+1 because index starts at 1) because cumsum applies on the flipped vector.
Once you have done this, it's very straightforward, you just have to put 1 at computed indexes and 0 elsewhere, and cumsum it:
idx_begs = fliplr(sum(input_lengths)+1-cumsum(fliplr(input_lengths)));
result = zeros(1,sum(input_lengths));
result(idx_begs) = 1;
result = cumsum(result);
EDIT
First, please have a look at Luis Mendo's solution, it is very close to mine but is more simpler and a bit faster (I won't edit mine even it is very close). I think at this date this is the fastest solution from all.
Second, while looking at others solutions, I've made up another one-liner, a little different from my initial solution and from the other one-liner. Ok, this won't be very readable, so take a breath:
result = cumsum( full(sparse(cumsum([1,input_lengths(1:end-1)]), ...
ones(1,length(input_lengths)), 1, sum(input_lengths),1)) );
I cut it on two lines. Ok now let's explain it.
The similar part is to build the array of the indexes where to increment the value of the current element. I use the solution of Luis Mendo's for that. To build in one line the solution vector, I use here the fact that it is in fact a sparse representation of the binary vector, the one we will cumsum at the very end. This sparse vector is build using our computed index vector as x positions, a vector of 1 as y positions, and 1 as the value to put at these locations. A fourth argument is given to precise the total size of the vector (important if the last element of input_lengths is not 1). Then we get the full representation of this sparse vector (else the result is a sparse vector with no empty element) and we can cumsum.
There is no use of this solution other than to give another solution to this problem. A benchmark can show that it is slower than my original solution, because of a heavier memory load.