How can one compute weighted median in KDB?
I can see that there is a function med for a simple median but I could not find something like wmed similar to wavg.
Thank you very much for your help!
For values v and weights w, med v where w gobbles space for larger values of w.
Instead, sort w into ascending order of v and look for where cumulative sums reach half their sum.
q)show v:10?100
17 23 12 66 36 37 44 28 20 30
q)show w:.001*10?1000
0.418 0.126 0.077 0.829 0.503 0.12 0.71 0.506 0.804 0.012
q)med v where "j"$w*1000
36f
q)w iasc v / sort w into ascending order of v
0.077 0.418 0.804 0.126 0.506 0.012 0.503 0.12 0.71 0.829
q)0.5 1*(sum;sums)#\:w iasc v / half the sum and cumulative sums of w
2.0525
0.077 0.495 1.299 1.425 1.931 1.943 2.446 2.566 3.276 4.105
q).[>]0.5 1*(sum;sums)#\:w iasc v / compared
1111110000b
q)v i sum .[>]0.5 1*(sum;sums)#\:w i:iasc v / weighted median
36
q)\ts:1000 med v where "j"$w*1000
18 132192
q)\ts:1000 v i sum .[>]0.5 1*(sum;sums)#\:w i:iasc v
2 2576
q)wmed:{x i sum .[>]0.5 1*(sum;sums)#\:y i:iasc x}
Some vector techniques worth noticing:
Applying two functions with Each Left (sum;sums)#\: and using Apply . and an operator on the result, rather than setting a variable, e.g. (0.5*sum yi)>sums yi:y i or defining an inner lambda {sums[x]<0.5*sum x}y i
Grading one list with iasc to sort another
Multiple mappings through juxtaposition: v i sum ..
You could effectively weight the median by duplicating (using where):
q)med 10 34 23 123 5 56 where 4 1 1 1 1 1
10f
q)med 10 34 23 123 5 56 where 1 1 1 1 1 4
56f
q)med 10 34 23 123 5 56 where 1 2 1 3 2 1
34f
If your weights are percentages (e.g. 0.15 0.10 0.20 0.30 0.25) then convert them to equivalent whole/counting numbers
q)med 1 2 3 4 5 where "i"$100*0.15 0.10 0.20 0.30 0.25
4f
Related
Let's say i have this problem and wanted to solve it using dimacs and maxsat solvers
There's 10 police patrols and i want solver to pick the best police patrol to go to intervention, each patrol is described by 3 variables (status, distance, districts)
so there will be 3 group of clauses
for example first patrol will be PP1 = x1,x11,x21, PP2 = x2,x12,x22 PP3 = x3,x13,x23 .. PP10 = x10,x20,x30
group 1 describing police patrol status, (300 means weight)
300 C1 - (x1 v x2 v x3)
50 C2 - (x4 v x5)
10 C3 - (x6 v x7 v x8 v x9 v x10)
C1 means their status is the best and C3 means it's the worst
group 2 describing police patrol distance to some incident or crime happening
300 C4 - (x11 v x12 v x13)
50 C5 - (x14 v x15)
10 C6 - (x16 v x17 v x18 v x19 v x20 )
C4 means they are the closest to incident, in C6 they are farthest
group 3 describing in what district they are
300 C7 - (x21 v x22 v x23)
50 C8 - (x24 v x25)
10 C9 - (x26 v x27 v x28 v x29 v x30)
C7 will be the safest etc.
So this is my wcnf file in dimacs, i don't know if it's good but will be pleased if you correct me what's wrong with it
p wcnf 30 9
300 1 2 3 0
50 4 5 0
10 6 7 8 9 10 0
300 20 11 14 0
50 15 16 17 0
10 12 13 18 19 0
300 29 21 27 0
50 22 23 24 25 0
10 26 28 30 0
I tested it in 2 solvers, rc2 maxsat solver and EvalMaxSAT and output was like this:
EvalMaxSAT
s OPTIMUM FOUND
o 0
v -1 2 -3 4 -5 6 -7 -8 -9 -10 11 12 -13 -14 15 -16 -17 -18 -19 -20 21 22 -23 -24 -25 26 -27 -28 -29 -30
c Total time: 335 µs
-
rc2
c formula: 30 vars, 0 hard, 9 soft
s OPTIMUM FOUND
o 0
v 1 -2 -3 4 -5 6 -7 -8 -9 -10 11 12 -13 -14 15 -16 -17 -18 -19 -20 21 22 -23 -24 -25 26 -27 -28 -29 -30
but looking at my wcnf file, I think ideal output should get values 1,11,21 as true because they are in the clauses with highest weight
Im trying to implement a custom min max scaler in kdb+/q. I have taken note of the implementation located in the ml package however I'm looking to be able to scale data between a custom range i.e. 0 and 255. What would be an efficient implementation of min max scaling in kdb+/q?
Thanks
Looking at the link to github on the page you referenced it looks like you may be able to define a function like so:
minmax255:{[sf;x]sf*(x-mnx)%max[x]-mnx:min x}[255]
Where sf is your scaling factor (here given by 255).
q)minmax255 til 10
0 28.33333 56.66667 85 113.3333 141.6667 170 198.3333 226.6667 255
If you don't like decimals you could round to the nearest whole number like:
q)minmax255round:{[sf;x]floor 0.5+sf*(x-mnx)%max[x]-mnx:min x}[255]
q)minmax255round til 10
0 28 57 85 113 142 170 198 227 255
(logic here is if I have a number like 1.7, add .5, and floor I'll wind up with 2, whereas if I had a number like 1.2, add .5, and floor I'll end up with 1)
If you don't want to start at 0 you could use | which takes the max of it's left and right arguments
q)minmax255roundlb:{[sf;lb;x]lb|floor sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmax255roundlb til 10
10 28 56 85 113 141 170 198 226 255
Where I'm using lb to mean 'lower bound'
If you want to apply this to a table you could use
q)show testtab:([]a:til 10;b:til 10)
a b
---
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
q)update minmax255 a from testtab
a b
----------
0 0
28.33333 1
56.66667 2
85 3
113.3333 4
141.6667 5
170 6
198.3333 7
226.6667 8
255 9
The following will work nicely
minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
As petty as it sounds, it is my strong recommendation that you do not follow through with Shehir94 solution for a custom minimum value. Applying a maximum to get a starting range, it will mess with the original distribution. A custom minmax scaling should be a simple linear transformation on a standard 0-1 minmax transformation.
X' = a + bX
For example, to get a custom scaling of 10-255, that would be a b=245 and a=10, we would expect the new mean to follow this formula and the standard deviation to only be a Multiplicative, but applying lower bound messes with this, for example.
q)dummyData:10000?100.0
q)stats:{`transform`minVal`maxVal`avgVal`stdDev!(x;min y;max y; avg y; dev y)}
q)minmax255roundlb:{[sf;lb;x]lb|sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
q)res:stats'[`orig`lb`linear;(dummyData;minmax255roundlb dummyData;minmaxCustom[10;255;dummyData])]
q)res
transform minVal maxVal avgVal stdDev
-----------------------------------------------
orig 0.02741043 99.98293 50.21896 28.92852
lb 10 255 128.2518 73.45999
linear 10 255 133.024 70.9064
// The transformed average should roughly be
q)10 + ((255-10)%100)*49.97936
132.4494
// The transformed std devaition should roughly be
q)2.45*28.92852
70.87487
To answer the comment, this could be applied over a large number of coluwould be applied to a table in the following manner
q)n:10000
q)tab:([]sym:n?`3;col1:n?100.0)
q)multiColApply:{[tab;scaler;colList]flip ft,((),colList)!((),scaler each (ft:flip tab)[colList])}
q)multiColApply[tab;minmaxCustom[10;20];`col1`col2]
sym col1 col2 col3
------------------------------
cag 13.78461 10.60606 392.7524
goo 15.26201 16.76768 517.0911
eoh 14.05111 19.59596 515.9796
kbc 13.37695 19.49495 406.6642
mdc 10.65973 12.52525 178.0839
odn 16.24697 17.37374 301.7723
ioj 15.08372 15.05051 785.033
mbc 16.7268 20 534.7096
bhj 12.95134 18.38384 711.1716
gnf 19.36005 15.35354 411.597
gnd 13.21948 18.08081 493.1835
khi 12.11997 17.27273 578.5203
When using glmfit in matlab, there are different problem setups that can be used:
x = [2100 2300 2500 2700 2900 3100 ...
3300 3500 3700 3900 4100 4300]';
n = [48 42 31 34 31 21 23 23 21 16 17 21]';
y = [1 2 0 3 8 8 14 17 19 15 17 21]';
[b dev] = glmfit(x,[y n],'binomial','link','probit');
Here they fit numerical data where n is the number of items tested, and y is the number of successes.
X = meas(51:end,:);
y = strcmp('versicolor',species(51:end));
b = glmfit(X,y,'binomial','link','logit')
In this case the y variable is binary and no n value is required (is that correct?)
In my case I have data on greyhoud races.
For each race I have a dummy variable (y) that takes value one when the dog wins and zero otherwise.
Q1.) For this setup I should use this formulation correct (with no n value supplied)?
[b dev] = glmfit(X,y,'binomial','link','logit')
Q2.) What is the precise definition of dev? It says in the support that it is a generalization of the residual sum of squares squares, but does not define it precisely.
Thanks
I am a beginner with MATLAB and I am struggling with this assignment. Can anyone guide me through it?
Consider the data given below:
x = [ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , ...
34, 47, 455, 21, , 22, 100 ];
Once the data is loaded, see if you can find any:
Outliers or
Missing data in the data file
Correct the missing values using median, mode and noisy data using median binning, mean binning and bin boundaries.
This isn't so bad. First off, take a look at the distribution of your data. You can see that the majority of your data has double digits. The outliers are those with single digits, or those that are way larger than double digits. Mind you, this is totally subjective so someone else may tell you that the single digits are part of your data too. Also, the missing data are those numbers that are spaces in between the commas. Let's write some MATLAB code and change these to NaN (or not-a-number), because if you try copying and pasting this code directly into MATLAB, it will give you a syntax error because if you are explicitly defining numbers this way, you have to be sure all of them are there.
To do this, use regexprep so that any parts of this string that have a comma, space, then another comma, put a NaN in between. To do this, we need to put this statement as a string first. We then use eval to convert this string to an actual MATLAB statement:
x = '[ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , 34, 47, 455, 21, , 22, 100 ];'
y = eval(regexprep(x, ', ,', ', NaN, '));
If we display this data, we get:
y =
Columns 1 through 6
1 48 81 2 10 25
Columns 7 through 12
NaN 14 18 53 41 56
Columns 13 through 18
89 0 1000 NaN 34 47
Columns 19 through 23
455 21 NaN 22 100
As such, to answer our first question, any values that are missing are denoted as NaN and those numbers that are bigger than double digits are outliers.
For the next question, we simply extract those values that are not missing, calculate the mean and median of what is not missing, and fill in those NaN values with the mean and median. For the bin boundaries, this is the same thing as using the values to the left (or right... depends on your definition, but let's use left) of the missing value and fill those in. As such:
yMissing = isnan(y); %// Which values are missing?
y_noNaN = y(~yMissing); %// Extract the non-missing values
meanY = mean(y_noNaN); %// Get the mean
medianY = median(y_noNaN); %// Get the median
%// Output - Fill in missing values with median
yMedian = y;
yMedian(yMissing) = medianY;
%// Same for mean
yMean = y;
yMean(yMissing) = meanY;
%// Bin boundaries
yBinBound = y;
yBinBound(yMissing) = y(find(yMissing)-1);
The mean and median for the data of the non-missing values is:
meanY =
105.8500
medianY =
37.5000
The outputs for each of these, in addition to the original data with the missing values looks like:
format bank; %// Do this to show just the first two decimal places for compact output
format compact;
y =
Columns 1 through 5
1 48 81 2 10
Columns 6 through 10
25 NaN 14 18 53
Columns 11 through 15
41 56 89 0 1000
Columns 16 through 20
NaN 34 47 455 21
Columns 21 through 23
NaN 22 100
yMean =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 105.85 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
105.85 34.00 47.00 455.00 21.00
Columns 21 through 23
105.85 22.00 100.00
yMedian =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 37.50 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
37.50 34.00 47.00 455.00 21.00
Columns 21 through 23
37.50 22.00 100.00
yBinBound =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 25.00 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
1000.00 34.00 47.00 455.00 21.00
Columns 21 through 23
21.00 22.00 100.00
If you take a look at each of the output values, this fills in our data with the mean, median and also the bin boundaries as per the question.
Once again, sorry if this has been asked before and if its too specific but I'm very stuck and can't quite find a solution.
I have a matrix of say 3 members of a structure called 2, 4 and 16 (in column 1) that have values along their relative distance e.g. member 2 has values at the start, 0m, then at 0.5m then the end of its length 1.5m, where member 4 starts at 0m etc. So that my matrix looks like this:
2 0 125
2 0.5 25
2 1.5 365
4 0 25
4 0.6 57
16 0 354
16 0.2 95
16 0.8 2
and I want to create a matrix that has the overall distance along all the members 2, 4 and 16 combined:
2 0 125
2 0.5 25
2 1.5 365
4 1.5 25
4 2.1 57
16 2.1 354
16 2.3 95
16 3.1 2
is there any way to do this in matlab? Like possibly locating the first zero and adding the value above it to all the rest of the values below then find the next zero value and so on?
Please tell me if this isn't clear, I realise it's a bit confusing but not too sure how to explain it better!
I came up with the following:
idx = find(diff(M(:,1)));
v = zeros(size(M,1),1);
v(idx+1) = M(idx,2);
M(:,2) = M(:,2) + cumsum(v);
The result:
M =
2 0 125
2 0.5 25
2 1.5 365
4 1.5 25
4 2.1 57
16 2.1 354
16 2.3 95
16 2.9 2
Note the last value in the second column disagrees with what you described (2.9 vs 3.1). Either you had a typo, or I'm still not getting it...
data = [2 0 125;
2 0.5 25;
2 1.5 365;
4 0 25;
4 0.6 57;
16 0 354;
16 0.2 95;
16 0.8 2];
idx0 = find(data(:,2)==0);
idx0 = idx0(2:end); %ignore first zero of first member, doesn't need an offset
offset = data(idx0-1,2);
N = size(data,1);
for ii=1:numel(idx0)
idxs = 1:N>=idx0(ii);
data(idxs,2) = data(idxs,2) + offset(ii);
end