KDB+/Q: Custom min max scaler - kdb

Im trying to implement a custom min max scaler in kdb+/q. I have taken note of the implementation located in the ml package however I'm looking to be able to scale data between a custom range i.e. 0 and 255. What would be an efficient implementation of min max scaling in kdb+/q?
Thanks

Looking at the link to github on the page you referenced it looks like you may be able to define a function like so:
minmax255:{[sf;x]sf*(x-mnx)%max[x]-mnx:min x}[255]
Where sf is your scaling factor (here given by 255).
q)minmax255 til 10
0 28.33333 56.66667 85 113.3333 141.6667 170 198.3333 226.6667 255
If you don't like decimals you could round to the nearest whole number like:
q)minmax255round:{[sf;x]floor 0.5+sf*(x-mnx)%max[x]-mnx:min x}[255]
q)minmax255round til 10
0 28 57 85 113 142 170 198 227 255
(logic here is if I have a number like 1.7, add .5, and floor I'll wind up with 2, whereas if I had a number like 1.2, add .5, and floor I'll end up with 1)
If you don't want to start at 0 you could use | which takes the max of it's left and right arguments
q)minmax255roundlb:{[sf;lb;x]lb|floor sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmax255roundlb til 10
10 28 56 85 113 141 170 198 226 255
Where I'm using lb to mean 'lower bound'
If you want to apply this to a table you could use
q)show testtab:([]a:til 10;b:til 10)
a b
---
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
q)update minmax255 a from testtab
a b
----------
0 0
28.33333 1
56.66667 2
85 3
113.3333 4
141.6667 5
170 6
198.3333 7
226.6667 8
255 9

The following will work nicely
minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
As petty as it sounds, it is my strong recommendation that you do not follow through with Shehir94 solution for a custom minimum value. Applying a maximum to get a starting range, it will mess with the original distribution. A custom minmax scaling should be a simple linear transformation on a standard 0-1 minmax transformation.
X' = a + bX
For example, to get a custom scaling of 10-255, that would be a b=245 and a=10, we would expect the new mean to follow this formula and the standard deviation to only be a Multiplicative, but applying lower bound messes with this, for example.
q)dummyData:10000?100.0
q)stats:{`transform`minVal`maxVal`avgVal`stdDev!(x;min y;max y; avg y; dev y)}
q)minmax255roundlb:{[sf;lb;x]lb|sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
q)res:stats'[`orig`lb`linear;(dummyData;minmax255roundlb dummyData;minmaxCustom[10;255;dummyData])]
q)res
transform minVal maxVal avgVal stdDev
-----------------------------------------------
orig 0.02741043 99.98293 50.21896 28.92852
lb 10 255 128.2518 73.45999
linear 10 255 133.024 70.9064
// The transformed average should roughly be
q)10 + ((255-10)%100)*49.97936
132.4494
// The transformed std devaition should roughly be
q)2.45*28.92852
70.87487
To answer the comment, this could be applied over a large number of coluwould be applied to a table in the following manner
q)n:10000
q)tab:([]sym:n?`3;col1:n?100.0)
q)multiColApply:{[tab;scaler;colList]flip ft,((),colList)!((),scaler each (ft:flip tab)[colList])}
q)multiColApply[tab;minmaxCustom[10;20];`col1`col2]
sym col1 col2 col3
------------------------------
cag 13.78461 10.60606 392.7524
goo 15.26201 16.76768 517.0911
eoh 14.05111 19.59596 515.9796
kbc 13.37695 19.49495 406.6642
mdc 10.65973 12.52525 178.0839
odn 16.24697 17.37374 301.7723
ioj 15.08372 15.05051 785.033
mbc 16.7268 20 534.7096
bhj 12.95134 18.38384 711.1716
gnf 19.36005 15.35354 411.597
gnd 13.21948 18.08081 493.1835
khi 12.11997 17.27273 578.5203

Related

KDB/Q How to implement moving rank efficiently?

I am trying to implement a moving rank function, taking parameters of n, the number of items, and m, the column name. Here is how I implement it:
mwindow: k){[y;x]$[y>0;x#(!#x)+\:!y;x#(!#x)+\:(!-y)+y+1]};
mrank: {[n;x] sum each x > prev mwindow[neg n;x]};
But this seems to take quite some time if n is moderately large, say 100.
I figure it is because it has to calculate from scratch, unlike msum, which keeps a running variable and only calculate the difference between the newly added and the dropped.
There's a number of general sliding window functions here that you can use to generate rolling lists on which to apply your rank: https://code.kx.com/q/kb/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
Those approaches seem to fill the lists out with zeros/nulls however which I think won't really suit your use of rank. Here's another possible approach which might be more suitable to rank (though I haven't tested this for performance on the large scale):
q)mwin:{x each (),/:{neg[x]sublist y,z}[y]\[z]}
q)update r:mwin[rank;4;c] from ([]c:10?100)
c r
----------
84 ,0
25 1 0
31 2 0 1
0 3 1 2 0
51 1 2 0 3
29 2 0 3 1
25 0 3 2 1
73 2 1 0 3
0 2 1 3 0
6 2 3 0 1
q)update r:last each mwin[rank;4;c] from ([]c:10?100)
c r
----
38 0
72 1
13 0
77 3
64 1
9 0
37 1
79 3
97 3
63 1
q)

Calculating the weighted moving average of 2 lists using a set window

If I have two lists:
a:1 2 3 4;
b:10 20 30 40;
I want to sum the product of the two lists within a window of 2. So the result set should be:
10 50 130 250
For example, to get the result of 130 it would be (2*20)+(3*30) = 130
sums 2 mavg '(a*b)
seems to get me part way there, but the window of 2 isn't being applied. I've tried experimenting with sum, sums, sum each, wavg, mavg, etc. and I am completely stuck. Could anyone help? Thanks!
This line should work for you:
2 msum a*b
as demonstrated here:
q)a:1 2 3 4
q)b:10 20 30 40
q)2 msum a*b
10 50 130 250
For more information about the keyword msum, you could check out the Kx Reference page:
https://code.kx.com/wiki/Reference/msum
Hope that helps!
Alternatively you could use the adverb each prior:
q)+':[a*b]
However this will only work with a window size of 2 and if your data contains null values this needs to be padded with 0:
q)+':[0^a*b2]
On a positive note it is faster than using msum in this situation.
q)\ts:1000000 +':[0^a*b2]
940 1264
q)\ts:1000000 2 msum a*b2
1556 1104

Clustering data after linkage algorithm

I am not an expert in statistics and data analysis, hence I can't understand if the behavior which I obtain is correct or not. I am here looking for your help.
Assume I have these samples which I would like to cluster (10 points in the plane - reduced version of the problem):
[X Y] =
266 450
266 400
258 168
290 442
295 438
273 432
294 158
318 161
250 423
253 413
To cluster them I can use a cluster tree
Z = linkage([ X Y ],'complete');
which is (by dendrogram(Z,10))
Now I would like to extract clusters on the basis of the distance attached to the nodes of the tree.
Say that my distance is 150, I would expect that the call
T = cluster(Z,'Cutoff',150);
returns me 2 clusters. But it gives me just one (I suppose), i.e.
T =
1
1
1
1
1
1
1
1
1
1
What am I missing?
Use inconsistent(Z,150) and look at the values in column 4. Increasing the cutoff from a small positive number steps you along the tree.
E.g.
cluster(Z,'cutoff',0.7)
does not give you what you want (I think)
but
cluster(Z,'cutoff',0.8)
does.
The criterion for cluster is inconsistency ('inconsistent') by default.
Since the height in dendrogram is distance, you can change the criterion to 'distance',
i.e:
T = cluster(Z, 'Cutoff', 150, 'criterion', 'distance');

Matlab - Sum of surrounding elements

I want to calculate the sum of the elements surrounding a given element in a matrix. So far, I have written these lines of code:
for i=1:m,
rij(1:n)=0
for j=1:n,
alive = tijdelijk(i-1,j)+tijdelijk(i+1,j)+tijdelijk(i-1,j-1)+tijdelijk(i+1,j-1)+tijdelijk(i,j+1)+tijdelijk(i,j-1)+tijdelijk(i-1,j+1)+tijdelijk(i+1,j+1)
This results in an error because, for example, i-1 becomes zero for i=1. Anyone got an idea how to do this without getting this error?
You can sum the elements via filtering. conv2 can be used for this manner.
Let me give an example. I create a sample matrix
>> A = reshape(1:20, 4, 5)
A =
1 5 9 13 17
2 6 10 14 18
3 7 11 15 19
4 8 12 16 20
Then, I create a filter. The filter is like a mask where you put the center on the current cell and the locations corresponding to the 1's on the filter are summed. For eight-connected neighbor case, the filter should be as follows:
>> B = [1 1 1; 1 0 1; 1 1 1]
B =
1 1 1
1 0 1
1 1 1
Then, you simply convolve the matrix with this small matrix.
>> conv2(A, B, 'same')
ans =
13 28 48 68 45
22 48 80 112 78
27 56 88 120 83
18 37 57 77 50
If you want four-connected neighbors, you can make the corners of your filter 0. Similarly, you can design any filter for your purpose, such as for averaging all neighbors instead of summing them.
For details, please see the convolution article in Wikipedia.
Two possibilities : change the limits of the loops to i=k:(m-k) and j=k:(n-k) or use blkproc
ex :
compute the 2-D DCT of each 8-by-8 block
I = imread('cameraman.tif');
fun = #dct2;
J = blkproc(I,[8 8],fun);
imagesc(J), colormap(hot)
There are lots of things you can do at the edges. Which you do depends very specifically on your problem and is different from usage case to usage case. Typical things to do:
If (i-1) or (i+1) is out of range, then just ignore that element. This is equivalent to zero padding the matrix with zeros around the outside and adjusting the loop limits accordingly
Wrap around the edges. In other words, for an MxN matrix, if (i-1) takes you to 0 then instead of taking element (i-1, j) = (0, j) you take element (M, j).
Since your code mentions "your teacher" I'd guess that you can ask what should happen at the edges (or working it out in a sensible manner may well be part of the task!!).

find and replace zeros with a function in matlab

Once again, sorry if this has been asked before and if its too specific but I'm very stuck and can't quite find a solution.
I have a matrix of say 3 members of a structure called 2, 4 and 16 (in column 1) that have values along their relative distance e.g. member 2 has values at the start, 0m, then at 0.5m then the end of its length 1.5m, where member 4 starts at 0m etc. So that my matrix looks like this:
2 0 125
2 0.5 25
2 1.5 365
4 0 25
4 0.6 57
16 0 354
16 0.2 95
16 0.8 2
and I want to create a matrix that has the overall distance along all the members 2, 4 and 16 combined:
2 0 125
2 0.5 25
2 1.5 365
4 1.5 25
4 2.1 57
16 2.1 354
16 2.3 95
16 3.1 2
is there any way to do this in matlab? Like possibly locating the first zero and adding the value above it to all the rest of the values below then find the next zero value and so on?
Please tell me if this isn't clear, I realise it's a bit confusing but not too sure how to explain it better!
I came up with the following:
idx = find(diff(M(:,1)));
v = zeros(size(M,1),1);
v(idx+1) = M(idx,2);
M(:,2) = M(:,2) + cumsum(v);
The result:
M =
2 0 125
2 0.5 25
2 1.5 365
4 1.5 25
4 2.1 57
16 2.1 354
16 2.3 95
16 2.9 2
Note the last value in the second column disagrees with what you described (2.9 vs 3.1). Either you had a typo, or I'm still not getting it...
data = [2 0 125;
2 0.5 25;
2 1.5 365;
4 0 25;
4 0.6 57;
16 0 354;
16 0.2 95;
16 0.8 2];
idx0 = find(data(:,2)==0);
idx0 = idx0(2:end); %ignore first zero of first member, doesn't need an offset
offset = data(idx0-1,2);
N = size(data,1);
for ii=1:numel(idx0)
idxs = 1:N>=idx0(ii);
data(idxs,2) = data(idxs,2) + offset(ii);
end