Is the complexity of kdb's moving max function mmax O(n)? - kdb

I used function mmax to calculate moving max of a 10-million-length integer vector. I ran it 10 times to calculate the total execution time. The running time for window size 132 (15,025 milliseconds) is 6 times longer than for window size 22 (2,425 milliseconds). It seems the complexity of mmax is O(nw) rather than O(n), where w is the length of the sliding window).
To check if this is true for other similar products, I tried the same experiment on DolphinDB, a time series database with built-in analytics features (https://www.dolphindb.com/downloads.html ). In contrast, DolphinDB’s mmax has linear complexity O(n), regardless of the window size: 1,277 milliseconds (window size 132) and 1,233 milliseconds (window size 22).
The hardware being used for this experiment:
Server: Dell PowerEdge R630
Architecure: x86_64
CPU Model Name: Intel(R) Xeon(R) CPU E5-2650 v4 # 2.20GHz
Total logical CPU cores: 48
Total memory: 256G
Experiment setup
I used KDB+ 4.0 64 bit version and DolphinDB_Linux_V2.00.7(DolphinDB community version: 2 cores and 8GB memory). Both experiments are conducted using 2 cores of CPU.
KDB implementation
// Start the server
rlwrap -r taskset -c 0,1 ./l64/q -p 5002 -s 2
// The code
a:10000000?10000i
\t do[10; 22 mmax a]
2425
\t do[10; 132 mmax a]
15025
DolphinDB implementation
// Start the server
rlwrap -r ./dolphindb -localSite localhost:5002:local5002 -localExecutors 1
// The code
a=rand(10000,10000000)
timer(10) mmax(a,22);
1232.83 ms
timer(10) mmax(a,132);
1276.53 ms
Can any kdb expert confirm the complexity of function mmax? If the built-in function mmax does have the complexity of O(nw), any third-party plugin for kdb to improve the performance?

Yes, it would scale with the size of the window as well as the size of the list. If you look at the definition of mmax:
q)mmax
k){(x-1)|':/y}
it is "equivalent" to
q)a:8 1 9 5 4 6 6 1 8 5
q)w:3
q)mmax[w;a]~{{max x,y}':[x]}/[w-1;a]
1b
which can more clearly be understood as the last output of:
q){{max x,y}':[x]}\[w-1;a]
8 1 9 5 4 6 6 1 8 5
8 8 9 9 5 6 6 6 8 8
8 8 9 9 9 6 6 6 8 8
....take the max of each item with its previous item, {max x,y}':[x]
....then do that same operation again on the output, {}\
....do the same operation again on the output (w-1) times, \[w-1;a]
From that it's clear that the window size impacts the number of times the loop is performed. As for faster implementation, there might be a different but less "elegant" algorithm which does it quicker and could be written in k/q. Otherwise you could import an implementation written in C - see https://code.kx.com/q/ref/dynamic-load/

Related

Matlab: I/O Delay detection

I have a continuous process with 3 inputs and 1 output. The 3 inputs are consecutive in time: Input 1 lags the output by 30 minutes, Input 2 by 15 etc.
My dataset below shows a startup for the system after a shutdown:
I1 I2 I3 Out
0 0 0 0
3 0 0 0
8 4 0 0
13 8 6 0
22 13 9 3.2
It can be seen how input1 started and everything else followed.
My question: in Matlab, what should I look for in order to determine such I/O delay for more complex datasets?
You should pay a close look to xcorr
xcorr performs a cross-correlation between two vectors (typically time signals) and checks their conformity in dependence of a time shift between the signals. A constant I/O lag should appear as a local maximum value for the correlation coefficient.

How to compute & plot Equal Error Rate (EER) from FAR/FRR values using matlab

I have the following values against FAR/FRR. i want to compute EER rates and then plot in matlab.
FAR FRR
19.64 20
21.29 18.61
24.92 17.08
19.14 20.28
17.99 21.39
16.83 23.47
15.35 26.39
13.20 29.17
7.92 42.92
3.96 60.56
1.82 84.31
1.65 98.33
26.07 16.39
29.04 13.13
34.49 9.31
40.76 6.81
50.33 5.42
66.83 1.67
82.51 0.28
Is there any matlab function available to do this. can somebody explain this to me. Thanks.
Let me try to answer your question
1) For your data EER can be the mean/max/min of [19.64,20]
1.1) The idea of EER is try to measure the system performance against another system (the lower the better) by finding the equal(if not equal then at least nearly equal or have the min distance) between False Alarm Rate (FAR) and False Reject Rate (FRR, or missing rate) .
Refer to your data, [19.64,20] gives min distance, thus it could used as EER, you can take mean/max/min value of these two value, however since it means to compare between systems, thus make sure other system use the same method(mean/max/min) to pick EER value.
The difference among mean/max/min can be ignored if the there are large amount of data. In some speaker verification task, there will be 100k data sample.
2) To understand EER ,better compute it by yourself, here is how:
two things you need to know:
A) The system score for each test case (trial)
B) The true/false for each trial
After you have A and B, then you can create [trial, score,true/false] pairs then sort it by the score value, after that loop through the score, eg from min-> max. At each loop assume threshold is that score and compute the FAR,FRR. After loop through the score find the FAR,FRR with "equal" value.
For the code you can refer to my pyeer.py , in function processDataTable2
https://github.com/StevenLOL/Research_speech_speaker_verification_nist_sre2010/blob/master/SRE2010/sid/pyeer.py
This function is written for the NIST SRE 2010 evaluation.
4) There are other measures similar to EER, such as minDCF which only play with the weights of FAR and FRR. You can refer to "Performance Measure" of http://www.nist.gov/itl/iad/mig/sre10results.cfm
5) You can also refer to this package https://sites.google.com/site/bosaristoolkit/ and DETware_v2.1.tar.gz at http://www.itl.nist.gov/iad/mig/tools/ for computing and plotting EER in Matlab
Plotting in DETWare_v2.1
Pmiss=1:50;Pfa=50:-1:1;
Plot_DET(Pmiss/100.0,Pfa/100.0,'r')
FAR(t) and FRR(t) are parameterized by threshold, t. They are cumulative distributions, so they should be monotonic in t. Your data is not shown to be monotonic, so if it is indeed FAR and FRR, then the measurements were not made in order. But for the sake of clarity, we can order:
FAR FRR
1 1.65 98.33
2 1.82 84.31
3 3.96 60.56
4 7.92 42.92
5 13.2 29.17
6 15.35 26.39
7 16.83 23.47
8 17.99 21.39
9 19.14 20.28
10 19.64 20
11 21.29 18.61
12 24.92 17.08
13 26.07 16.39
14 29.04 13.13
15 34.49 9.31
16 40.76 6.81
17 50.33 5.42
18 66.83 1.67
19 82.51 0.28
This is for increasing FAR, which assumes a distance score; if you have a similarity score, then FAR would be sorted in decreasing order.
Loop over FAR until it is larger than FRR, which occurs at row 11. Then interpolate the cross over value between rows 10 and 11. This is your equal error rate.

How Matlab extract a subset of a bigger matrix by specifying the indices?

I have a matrix A
A =
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
i have another matrix to specify the indices
index =
1 2
3 4
Now, i have got third matrix C from A
C = A(index)
C =
1 6
11 16
Problem: I am unable to understand how come i have received such a matrixC. I mean, what is logi behind it?
The logic behind it is linear indexing: When you provide a single index, Matlab moves along columns first, then along rows, then along further dimensions (according to their order).
So in your case (4 x 5 matrix) the entries of A are being accessed in the following order (each number here represents order, not the value of the entry):
1 5 9 13 17
2 6 10 14 18
3 7 11 15 19
4 8 12 16 20
Once you get used to it, you'll see linear indexing is a very powerful tool.
As an example: to obtain the maximum value in A you could just use max(A(1:20)). This could be further simplified to max(A(1:end)) or max(A(:)). Note that "A(:)" is a common Matlab idiom, used to turn any array into a column vector; which is sometimes called linearizing the array.
See also ind2sub and sub2ind, which are used to convert from linear index to standard indices and vice versa.

MATLAB: Step through iterations of a vector

all.
I have a 15 element array = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15];.
I was wondering if there was a command such that it would step through iterations of the array without repeating itself. In other words, since there is a chance that randperm() will create the same matrix twice, I want to step through each permutation only once and perform a calculation.
I concede that there are factorial(15) permutations, but for my purposes, these two vectors (and similar) are identical and don't need to be counted twice:
[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
[15 14 13 12 11 10 9 8 7 6 5 4 3 2 1]
Thus, is there any way to step through this?
Thanks.
I think what you are looking for is perms. randperm returns a single random permutation, you want all the permutations.
So use
my_permuations = perms([1:15]);
If forward-backward is the same as backward-foward then you can use the top half of the list only...
my_permutation_to_use = my_permutations(1:length(my_permutations)/2, :);
You may compare all permutations, but this would require to store all past permutations. Instead a local decision is better. I recommend this simple rule:
A permutation is valid, if the first element is smaller than the last element.
A permutation is redundant, if the first element is larger than the last element.
For small sizes, this could simply be done with this code:
%generate all permutations
x=perms(1:10)
%select only the valid lines, remove all redundant lines
x(x(:,1)<x(:,end),:)
Remains the problem, that generating x for 1:15 breaks all memory limits and would require about 100h.

Setting up an ANN to classify Tic-Tac-Toe End-Games

I'm having an hard time setting up a neural network to classify Tic-Tac-Toe board states (final or intermediate) as "X wins", "O wins" or "Tie".
I will describe my current solution and results. Any advice is appreciated.
* DATA SET *
Dataset = 958 possible end-games + 958 random-games = 1916 board states
(random-games might be incomplete but are all legal. i.e. do not have both players winning simultaneously).
Training set = 1600 random sample of Dataset
Test set = remaining 316 cases
In my current pseudo-random development scenario the dataset has the following characteristics.
Training set:
- 527 wins for "X"
- 264 wins for "O"
- 809 ties
Test set:
- 104 wins for "X"
- 56 wins for "O"
- 156 ties
* Modulation *
Input Layer: 18 input neurons where each one corresponds to a board position and player. Therefore,
the board (B=blank):
x x o
o x B
B o X
is encoded as:
1 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0
Output Layer: 3 output neurons which correspond to each outcome (X wins, O wins, Tie).
* Architecture *
Based on: http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment2.tar.gz
1 Single Hidden Layer
Hidden Layer activation function: Logistic
Output Layer activation function: Softmax
Error function: Cross-Entropy
* Results *
No combination of parameters seems to achieve 100% correct classification rate. Some examples:
NHidden LRate InitW MaxEpoch Epochs FMom Errors TestErrors
8 0,0025 0,01 10000 4500 0,8 0 7
16 0,0025 0,01 10000 2800 0,8 0 5
16 0,0025 0,1 5000 1000 0,8 0 4
16 0,0025 0,5 5000 5000 0,8 3 5
16 0,0025 0,25 5000 1000 0,8 0 5
16 0,005 0,25 5000 1000 0,9 10 5
16 0,005 0,25 5000 5000 0,8 15 5
16 0,0025 0,25 5000 1000 0,8 0 5
32 0,0025 0,25 5000 1500 0,8 0 5
32 0,0025 0,5 5000 600 0,9 0 5
8 0,0025 0,25 5000 3500 0,8 0 5
Important - If you think I could improve any of the following:
- The dataset characteristics (source and quantities of training and test cases) aren't the best.
- An alternative problem modulation is more suitable (encoding of input/output neurons)
- Better network architecture (Number of Hidden Layers, activation/error functions, etc.).
Assuming that my current options in this regard, even if not optimal, should not prevent the system from having a 100% correct classification rate, I would like to focus on other possible issues.
In other words, considering the simplicity of the game, this dataset/modulation/architecture should do it, therefore, what am I doing wrong regarding the parameters?
I do not have much experience with ANN and my main question is the following:
Using 16 Hidden Neurons, the ANN could learn to associate each Hidden Unit with "a certain player winning in a certain way"
(3 different rows + 3 different columns + 2 diagonals) * 2 players
In this setting, an "optimal" set of weights is pretty straightforward: Each hidden unit has "greater" connection weights from 3 of the input units (corresponding to a row, columns or diagonal of a player) and a "greater" connection weight to one of the output units (corresponding to "a win" of that player).
No matter what I do, I cannot decrease the number of test errors, as the above table shows.
Any advice is appreciated.
You are doing everything right, but you're simply trying to tackle a difficult problem here, namely to generalize from some examples of tic-tac-toe configurations to all others.
Unfortunately, the simple neural network you use does not perceive the spatial structure of the input (neighborhood) nor can it exploit the symmetries. So in order to get perfect test error, you can either:
increase the size of the dataset to include most (or all) possible configurations -- which the network will then be able to simply memorize, as indicated by the zero training error in most of your setups;
choose a different problem, where there is more structure to generalize from;
use a network architecture that can capture symmetry (e.g. through weight-sharing) and/or spatial relations of the inputs (e.g. different features). Convolutional networks are just one example of this.