My intention is to find its class through Bayes Classifier Algorithm.
Suppose, the following training data describes heights, weights, and feet-lengths of various sexes
SEX HEIGHT(feet) WEIGHT (lbs) FOOT-SIZE (inches)
male 6 180 12
male 5.92 (5'11") 190 11
male 5.58 (5'7") 170 12
male 5.92 (5'11") 165 10
female 5 100 6
female 5.5 (5'6") 150 8
female 5.42 (5'5") 130 7
female 5.75 (5'9") 150 9
trans 4 200 5
trans 4.10 150 8
trans 5.42 190 7
trans 5.50 150 9
Now, I want to test a person with the following properties (test data) to find his/her sex,
HEIGHT(feet) WEIGHT (lbs) FOOT-SIZE (inches)
4 150 12
This may also be a multi-row matrix.
Suppose, I am able to isolate only the male portion of the data and arrange it in a matrix,
and, I want to find its Parzen Density Function against the following row matrix that represents same data of another person(male/female/transgender),
(dataPoint may have multiple rows.)
so that we can find how closely matches this data with those males.
my attempted solution:
(1) I am unable to calculate the secondPart because of the dimensional mismatch of the matrices. How can I fix this?
(2) Is this approach correct?
MATLAB Code
male = [6.0000 180 12
5.9200 190 11
5.5800 170 12
5.9200 165 10];
dataPoint = [4 150 2]
variance = var(male);
parzen.m
function [retval] = parzen (male, dataPoint, variance)
clc
%male
%dataPoint
%variance
sub = male - dataPoint
up = sub.^2
dw = 2 * variance;
sqr = sqrt(variance*2*pi);
firstPart = sqr.^(-1);
e = dw.^(-1)
secPart = exp((-1)*e*up);
pdf = firstPart.* secPart;
retval = mean(pdf);
bayes.m
function retval = bayes (train, test, aprori)
clc
classCounts = rows(unique(train(:,1)));
%pdfmx = ones(rows(test), classCounts);
%%Parzen density.
%pdf = parzen(train(:,2:end), test(:,2:end), variance);
maxScore = 0;
pdfProduct = 1;
for type = 1 : classCounts
%if(type == 1)
clidxTrain = train(:,1) == type;
%clidxTest = test(:,1) == type;
trainMatrix = train(clidxTrain,2:end);
variance = var(trainMatrix);
pdf = parzen(trainMatrix, test, variance);
%dictionary{type, 1} = type;
%dictionary{type, 2} = prod(pdf);
%pdfProduct = pdfProduct .* pdf;
%end
end
for type=1:classCounts
end
retval = 0;
endfunction
First, your example person has a tiny foot!
Second, it seems you are mixing together kernel density estimation and naive Bayes. In a KDE, you estimate a pdf a sum of kernels, one kernel per data point in your sample. So if you wanted to do a KDE of the height of males, you would add together four Gaussians, each one centered at the height of a different male.
In naive Bayes, you assume that the features (height, foot size, etc.) are independent and that each one is normal distributed. You estimate the parameters of a single Gaussian per feature from your training data, then use their product to get the joint probability of a new example belonging to a certain class. The first page that you link explains this fairly well.
In code:
clear
human = [6.0000 180 12
5.9200 190 11
5.5800 170 12
5.9200 165 10];
tiger = [
2 2000 17
3 1980 16
3.5 2100 18
3 2020 18
4.1 1800 20
];
dataPoints = [
4 150 12
3 2500 20
];
sigSqH = var(human);
muH = mean(human);
sigSqT = var(tiger);
muT = mean(tiger);
for i = 1:size(dataPoints, 1)
i
probHuman = prod( 1./sqrt(2*pi*sigSqH) .* exp( -(dataPoints(i,:) - muH).^2 ./ (2*sigSqH) ) )
probTiger = prod( 1./sqrt(2*pi*sigSqT) .* exp( -(dataPoints(i,:) - muT).^2 ./ (2*sigSqT) ) )
end
Comparing the probability of tiger vs. human lets us conclude that dataPoints(1,:) is a person while dataPoints(2,:) is a tiger. You can make this model more complicated by, e.g., adding prior probabilities of being one class or the other, which would then multiply probHuman or probTiger.
Related
I want to select a random subset of a vector, much like datasample(data,k), but I want them in order.
I have an ODE which has [t,y] as output and it's the y that I want a subset of. I cannot just do a sort because y is not linear and so I somehow have to sort it with respect to t.
Any ideas how I can to this?
If I understand correctly, you want to sample the elements maintaining their original order. You can do it this way:
randomly sample the indices rather than the values;
sort the sampled indices;
use them to access the selected values;
that is:
result = data(sort(randsample(numel(data), k)));
The above uses the randsample function from the Statistics Toolbox. Alternatively, in recent Matlab versions you can use the two-input form of randperm:
result = data(sort(randperm(numel(data), k)));
For example, given
data = [61 52 43 34 25 16];
k = 4;
a possible result is
result =
61 43 34 25
This can be solved using a combination of randperm and intersect:
function q40673112
% Create a vector:
v = round(sin(0:0.6:6),3); disp(['v = ' mat2str(v)]);
% Set the size of sample we want:
N = 5;
% Create the random indices:
inds = intersect(1:numel(v), randperm(numel(v),N)); disp(['inds = ' mat2str(inds)]);
% Sample from the vector:
v_samp = v(inds); disp(['v_samp = ' mat2str(v_samp)]);
Example output:
% 1 2 3 4 5 6 7 8 9 10 11
v = [0 0.565 0.932 0.974 0.675 0.141 -0.443 -0.872 -0.996 -0.773 -0.279]
inds = [4 6 9 10 11]
v_samp = [0.974 0.141 -0.996 -0.773 -0.279]
I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?
For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.
I know that in order to filter large amount of data in chuncks, it it possible to use the function 'filter' with the appropriate filter coefficients,
and pass the final conditions 'zf' to the next chunk as its initial conditions 'zi'.
I am confused.
what is exactly the content of 'zf'?
does it hold the last relevant input samples? (in pure FIR filter)
the last relevant ouput samples? (in IIR)
what does it hold when both last inputs and last outputs are relevant?
thanks a lot
In case we have a large set of data or we are short in memory the zf and zi options will come in handy.
For example we can divide our data in two parts, x and newx, and use the filter function like,
[y,zf] = filter(b,a,x);
newy = filter(b,a,newx,zf);
For a filter with a and b as in,
we will be referring back to length(a) -1 samples of y and length(b) -1 samples of x.
So for continuing our filter over the second half we will need max(length(a),length(b)) -1 calls from the first half.
Example 1
y[n] = x[n] + 2 * x[n-1] + 3 * x[n-2];
which is,
a = 1;
b = [1 2 3];
example input and output are,
x = [1 2 3 4 5 6 7 8 9];
y = [1 4 10 16 22 28 34 40 46];
zf = [42 27]';
Implementing the filter over newx, for first two samples we have,
newy[1] = newx[1] + 2*9 + 3*8 = newx[1] + 42 = newx[1] + zf[1];
newy[2] = newx[2] + 2 * newx[1] + 3*9 = newx[2] + 2 * newx[1] + zf[2];
Example 2
x = 1 : 9;
b = [1 1 1];
a = [1 2];
[y,zf] = filter(b,a,x);
This corresponds to y[n] = x[n] + x[n-1] + x[n-2] - 2*y[n-1].
The inputs and outputs are:
x = [1 2 3 4 5 6 7 8 9];
y = [1 1 4 1 10 -5 28 -35 94];
zf = [-171 9]';
Now for the first value of second half:
newy[1] = newx[1] + 9 + 8 - 2 * 94 = newx[1] - 171 = newx[1] + zf(1);
newy[2] = newx[2] + newx[1] + 9 - 2*newy[1] = newx[2] + newx[1] + zf(2) - 2*newy[1];
So I think it's pretty obvious now, how zf works.
The values in zf contain the internal state of the IIR filter. There are various ways how these filters are implemented in practice, but in all of them there are some delay elements, which pass some values to the next iteration. See for example this section of the wikipedia entry about digital filters. In the 'direct form 1', there are some delay elements to hold the last few inputs and and some other delay elements to hold the last few outputs. In the 'direct form 2', the delay elements contain some intermediate results. Independent of the exact implementation, these memory locations should be restored to not cause any glitches in the output when processing the data in chunks.
When processing data in chunks, you should use the function filter like this:
filter_state = []; % start with empty state
for i = 1:num_chunks
input_chunk = get_chunk(i);
[output_chunk, filter_state] = filter(b, a, input_chunk, filter_state);
save_chunk(i, output_chunk)
end
I'm trying to solve the following problem:
I'v a kernel made of 0's and 1's ,
e.g a crosslike kernel
kernel =
0 1 0
1 1 1
0 1 0
and I need to apply it to a given matrix like
D =
16 2 3 13
5 11 10 8
9 7 6 12
4 14 15 1
for semplicity let's assume to start from element D(2,2), wich is 11, to avoid padding (that I can do with padarray).
I should superimpose the kernel and extract only elements where kernel==1, i.e
[2,5,11,10,7] then apply on them a custom filter like median or average and replacing central element with the result.
Then I would like to pass through all other elements (neglect edge elements for semplicity) and do the same.
Now I'm using tempS= ordfilt2(Z,order,kernel,'symmetric');
that performs exactly that operation with median filter. But I would like to use a different criterion (i.e. the average or some weird operation )
Use blockproc. This also handles border effects automatically (see the documentation). For example, to compute the median of the values masked by the kernel:
mask = logical(kernel);
R = blockproc(D, [1 1], #(d) median(d.data(mask)), ...
'bordersize', [1 1], 'trimborder', 0);
The first [1 1] indicated the step. The second [1 1] indicates how many elements to take around the central one.
With your example D, the result is
R =
2 3 3 3
9 7 8 10
5 9 10 6
4 7 6 1
This should do what you want:
D = rand(10,20);
kernel = [0,1,0;1,1,1;0,1,0];
[dy,dx] = find(kernel==1);
% should be calculated from kernel
dy = dy-2;
dx = dx-2;
% start and stop should calculated by using kernel size
result = zeros(size(D));
for y = 2:(size(D,1)-1)
for x = 2:(size(D,2)-1)
elements = D(sub2ind(size(D),y+dy,x+dx));
result(y,x) = weirdOperation(elements);
end
end
Nevertheless this will perform very poorly in terms of speed. You should consider use builtin functions. conv2 or filter2 for linear filter operations. ordfilt2 for order-statistic funtionality.
I am interested in calculating a function on a bunch of permutations of parameter values. I want to keep it generic to N dimensions, but let me write it out in 3 dimensions to start with. Generating the permutations is easy enough with meshgrid, but I can't figure out how to reshape the resulting array back to the multidimensions? Here is a starting point:
%These are the 3 variations of parameters, with some values.
params1 = [100, 200, 300];%Picking these so it is easy to correlate to the function
params2 = [10, 20];
params3 = [1, 2];
%This generates parameter_values as the cartesian productpermutations.
[vec1, vec2, vec3] = meshgrid(params1, params2, params3);
parameter_values = [vec1(:) vec2(:) vec3(:)];
%Calculates functions on the set of parameters.
%Would have a fancier function, of course, this just makes it easy to see the results.
raw_vals = parameter_values(:,1) + parameter_values(:,2) + parameter_values(:,3);
%Rearrange into a multiarray to access by parameter indices.
f_vals = reshape(raw_vals, [length(params1), length(params2), length(params3)]) %WRONG?
%THE FOLLOWING FAIL BUT WOULD BE EXPECTED WITH THESE PARAMETERS AND THE FUNCTION.
assert(f_vals(2,1,1) == 211)
assert(f_vals(3,2,2) == 322)
You want ndgrid instead of meshgrid in this case.
meshgrid's syntax is [X,Y] = meshgrid(xgv,ygv) which causes Y(:) to vary fastest rather than X(:). See Gridded Data Representation for more details. In other words, you are getting
>> [vec1, vec2, vec3] = meshgrid(params1, params2, params3)
vec1(:,:,1) =
100 200 300
100 200 300
vec1(:,:,2) =
100 200 300
100 200 300
vec2(:,:,1) =
10 10 10
20 20 20
vec2(:,:,2) =
10 10 10
20 20 20
...
But you want to be getting:
>> [vec1, vec2, vec3] = ndgrid(params1, params2, params3)
vec1(:,:,1) =
100 100
200 200
300 300
vec1(:,:,2) =
100 100
200 200
300 300
vec2(:,:,1) =
10 20
10 20
10 20
vec2(:,:,2) =
10 20
10 20
10 20
...
If you switch to ndgrid, then you get f_vals(2,1,1) == 211 as intended.
Generalizing to N-dimensions could be done like this:
params = {[100, 200, 300],[10, 20],[1, 2]};
vecs = cell(numel(params),1);
[vecs{:}] = ndgrid(params{:});
parameter_values = reshape(cat(numel(vecs)+1,vecs{:}),[],numel(vecs));
raw_vals = sum(parameter_values,2);
f_vals = reshape(raw_vals,cellfun(#numel,params))