Matlab Conditional probability from dataset - matlab

I have a Matrix M of 500x5 and I need to calculate conditional probability. I have discretised my data and then I have this code that currently only works with 3 variables rather than 5 but that's fine for now.
The code below already works out the number of times I get A=1, B=1 and C=1, the number of times we get A=2, B=1, C=1 etc.
data = M;
npatients=size(data,1)
asum=zeros(4,2,2)
prob=zeros(4,2,2)
for patient=1:npatients,
h=data(patient,1)
i=data(patient,2)
j=data(patient,3)
asum(h,i,j)=asum(h,i,j)+1
end
for h=1:4,
for i=1:2,
for j=1:2,
prob(h,i,j)=asum(h,i,j)/npatients
end
end
end
So I need code to sum over to get the number of time we get A=1 and B=1 (adding over all C) to find:
Prob(C=1 given A=1 and B=1) = P(A=1,B=1, C=1)/P( A=1, B=1).
This is the rule strength of the first rule. I need to find out how to loop over A, B and C to get the rest and how to actually get this to work in Matlab. I don't know if its of any use but I have code to put each column into its own thing.:
dest = M(:,1); gen = M(:,2); age = M(:,3); year = M(:,4); dur = M(:,5);
So say dest is the consequent and gen and age are the antecedents how would I do this.
Below is the data of the first 10 patients as an example:
destination gender age
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 1 1
3 2 2
2 2 2
3 2 1
3 2 1
Any help is appreciated and badly needed.

Sine your code didn't work by copy & paste, I changed it a little bit,
It's better if you define a function that calculates the probability for given data,
function p = prob(data)
n = size(data,1);
uniquedata = unique(data);
p = zeros(length(uniquedata),2);
p(:,2) = uniquedata;
for i = 1 : size(uniquedata,1)
p(i,1) = sum(data == uniquedata(i)) / n;
end
end
Now in another script,
data =[3 2 91;
3 2 86;
3 2 90;
3 2 85;
3 2 86;
3 1 77;
4 2 88;
3 2 90;
4 2 79;
4 2 77;
4 1 65;
3 1 60];
pdest = prob(data(:,1));
pgend = prob(data(:,2));
page = prob(data(:,3));
This will give,
page =
0.0833 60.0000
0.0833 65.0000
0.1667 77.0000
0.0833 79.0000
0.0833 85.0000
0.1667 86.0000
0.0833 88.0000
0.1667 90.0000
0.0833 91.0000
pgend =
0.2500 1.0000
0.7500 2.0000
pdest =
0.6667 3.0000
0.3333 4.0000
That will give the probabilities you've already calculated,
Note that the second column of prob is the valuse and the first column the probability.
When you want to calculate probabilities for des = 3 & gend = 2 you should create a new data set and call prob, for new data set use,
mapd2g3 = data(:,1) == 3 & data(:,2) == 2;
datad2g3 = data(mapd2g3,:)
3 2 91
3 2 86
3 2 90
3 2 85
3 2 86
3 2 90
paged2g3 = prob(datad2g3(:,3))
0.1667 85.0000
0.3333 86.0000
0.3333 90.0000
0.1667 91.0000
This is the prob(age|dest = 3 & gend = 2) .
You could even write a function to create the data sets.

Related

How to accumulate (average) data based on multiple criteria

I have a set of data where I have recorded values in sets of 3 readings (so as to be able to obtain a general idea of the SEM). I have them recorded in a list that looks as follows, which I am trying to collapse into averages of each set of 3 points:
I want to collapse essentially each 3 rows into one row where the average data value is given for that set. In essence, it would look as follows:
This is something I know how to do basically in Excel (i.e. using a Pivot table) but I am not sure how to do the same in MATLAB. I have tried using accumarray but struggle with knowing how to incorporate multiple conditions essentially. I would need to create a subs array where its number corresponds to each unique set of 3 data points. By brute force, I could create an array such as:
subs = [1 1 1; 2 2 2; 3 3 3; 4 4 4; ...]'
using some looping and have that as my subs array, but since it isn't tied to the data itself, and there may be strange hiccups throughout (i.e. more than 3 data points per set, or missing data, etc.). I know there must be some way to have this sort of Pivot-table-esque grouping for something like this, but need some help to get it off the ground. Thanks.
Here is the input data in text form:
Subject Flow On/Off Values
1 10 1 2.20
1 10 1 2.50
1 10 1 2.60
1 20 1 5.50
1 20 1 6.10
1 20 1 5.90
1 30 1 10.10
1 30 1 10.50
1 30 1 10.50
1 10 0 1.90
1 10 0 2.20
1 10 0 2.30
1 20 0 5.20
1 20 0 5.80
1 20 0 5.60
1 30 0 9.80
1 30 0 10.20
1 30 0 10.20
2 10 1 5.70
2 10 1 6.00
2 10 1 6.10
2 20 1 9.00
2 20 1 9.60
2 20 1 9.40
2 30 1 13.60
2 30 1 14.00
2 30 1 14.00
2 10 0 5.40
2 10 0 5.70
2 10 0 5.80
2 20 0 8.70
2 20 0 9.30
2 20 0 9.10
2 30 0 13.30
2 30 0 13.70
2 30 0 13.70
You can use unique and accumarray like so to maintain the order of your rows of data:
[newData, ~, subs] = unique(data(:, 1:3), 'rows', 'stable');
newData(:, 4) = accumarray(subs, data(:, 4), [], #mean);
newData =
1.0000 10.0000 1.0000 2.4333
1.0000 20.0000 1.0000 5.8333
1.0000 30.0000 1.0000 10.3667
1.0000 10.0000 0 2.1333
1.0000 20.0000 0 5.5333
1.0000 30.0000 0 10.0667
2.0000 10.0000 1.0000 5.9333
2.0000 20.0000 1.0000 9.3333
2.0000 30.0000 1.0000 13.8667
2.0000 10.0000 0 5.6333
2.0000 20.0000 0 9.0333
2.0000 30.0000 0 13.5667
I assume that
You want to average based on unique values of the first three columns (not on groups of three rows, although the two criteria coincide in your example);
Order is determined by column 1, then 3, then 2.
Then, denoting your data as x,
[~, ~, subs] = unique(x(:, [1 3 2]), 'rows', 'sorted');
result = accumarray(subs, x(:,end), [], #mean);
gives
result =
2.1333
5.5333
10.0667
2.4333
5.8333
10.3667
5.6333
9.0333
13.5667
5.9333
9.3333
13.8667
As you see, I am using the third output of unique with the 'rows' and 'sorted' options. This creates the subs grouping vector based on first three columns of your data in the desired order. Then, passing that to accumarray computes the means.
accumarray is indeed the way to go. First, you'll need assign an index to each set of values with unique :
[unique_subjects, ~, ind_subjects] = unique(vect_subjects);
[unique_flows, ~, ind_flows] = unique(vect_flows);
[unique_on_off, ~, ind_on_off] = unique(vect_on_off);
So basically, you now got ind_subjects, ind_flows and ind_on_off that are values in [1..2], [1..3] and [1..2].
Now, you can compute the mean values in a [3x2x2] array (in you example) :
mean_values = accumarray([ind_flows, ind_on_off, ind_subjects], vect_values, [], #mean);
mean_values = mean_values(:);
Nota : order is set accordingly to your example.
Then you can construct the summary :
[ind1, ind2, ind3] = ndgrid(1:numel(unique_flows), 1:numel(unique_on_off), 1:numel(unique_subjects));
flows_summary = unique_flows(ind1(:));
on_off_summary = unique_on_off(ind2(:));
subjects_summary = unique_subjects(ind3(:));
Nota : Also works with non numeric values.
You should also try checking out the findgroups and splitapply reference pages. The easiest way to use them here is probably to place your data in a table:
>> T = array2table(data, 'VariableNames', { 'Subject', 'Flow', 'On_Off', 'Values'});
>> [gid,Tgrp] = findgroups(T(:,1:3));
>> Tgrp.MeanValue = splitapply(#mean, T(:,4), gid)
Tgrp =
12×4 table
Subject Flow On_Off MeanValue
_______ ____ ______ _________
1 10 0 2.1333
1 10 1 2.4333
1 20 0 5.5333
1 20 1 5.8333
1 30 0 10.067
1 30 1 10.367
2 10 0 5.6333
2 10 1 5.9333
2 20 0 9.0333
2 20 1 9.3333
2 30 0 13.567
2 30 1 13.867

Piecewise average over n elements in Matlab

I want to piecewise-average a vector in Matlab. Vector x looks like this:
x = 1:15;
Respectively:
x = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
I want to find the mean value over n = 5 elements; therefore, the result-vector y should look like:
y = [1 1.5 2.5 3 4 5 6 7 8 9 10 11 12 13]
The code for generating the vector y should somehow work like this:
y = [
mean ([1])
mean ([1,2])
mean ([1,2,3])
mean ([1,2,3,4])
mean ([1,2,3,4,5])
mean ([2,3,4,5,6])
mean ([3,4,5,6,7])
mean ([4,5,6,7,8])
mean ([5,6,7,8,9])
mean ([6,7,8,9,10])
mean ([7,8,9,10,11])
mean ([8,9,10,11,12])
mean ([9,10,11,12,13])
mean ([10,11,12,13,14])
mean ([11,12,13,14,15])
]
For n < 5 elements, the program should average over n elements. For example, if there are only 3 elements available, the code should average the first 3 elements. For n > 5 elements, the program should average over the last 5 elements.
Any help is appreciated!
For such sliding summing or averaging operations, a very efficient vectorized approach would be with 1D convolution conv, like so -
n = 5
sums = conv(x,ones(1,n))
out = sums(1:numel(x))./[1:n n*ones(1,numel(x)-n)]
Try this:
x = 1:15;
for n = 1:length(x)
if n <= 5
y(n) = mean(x(1:n))
else
y(n) = mean(x(n-4:n))
end
end
Below is a kind of a brute force method.
for j=1:length(x)
A=j-4;
if A<1
A=1;
end;
y(j)=mean(x(A:j))
end;
or in a more compact form:
for j=1:length(x)
y(j)=mean(x(max(j-4,1):j));
end;
Here is an alternative method
make a matrix of all the numbers which you have to take average of in each row, here in bsxfun() function a vector four new rows are creating for each element of 1:15 row by row of previous 4 digits of the current number and then all the non zero elements are ommitted and amde 0
n =
5
A = bsxfun(#plus ,[1:15].',-(n -1):0)
A(A<0) = 0
A =
0 0 0 0 1
0 0 0 1 2
0 0 1 2 3
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10
7 8 9 10 11
8 9 10 11 12
9 10 11 12 13
10 11 12 13 14
11 12 13 14 15
and then divide the sum of each row by number of non-zero elements in each row
>> sum(A,2)./sum(~ismember(A,0),2)
ans =
1.0000
1.5000
2.0000
2.5000
3.0000
4.0000
5.0000
6.0000
7.0000
8.0000
9.0000
10.0000
11.0000
12.0000
13.0000

Combine matrices using loop and condition in matlab

I have the following two matrices
c=[1 0 1.05
1 3 2.05
1 6 2.52
1 9 0.88
2 0 2.58
2 3 0.53
2 6 3.69
2 9 0.18
3 0 3.22
3 3 1.88
3 6 3.98]
f=[1 6 3.9
1 9 9.1
1 12 9
2 0 0.3
2 3 0.9
2 6 1.2
2 9 2.5
3 0 2.7]
And the final matrix should be
n=[1 6 2.52 3.9
1 9 0.88 9.1
2 0 2.58 0.3
2 3 0.53 0.9
2 6 3.69 1.2
2 9 0.18 2.5
3 0 3.22 2.7]
The code I used gives as a result only the last row of the previous matrix [n].
for j=1
for i=1:rs1
for k=1
for l=1:rs2
if f(i,j)==c(l,k) && f(i,j+1)==c(l,k+1)
n=[f(i,j),f(i,j+1),f(i,j+2), c(l,k+2)];
end
end
end
end
end
Can anyone help me on this?
Is there something more simple?
Thanks in advance
You should learn to use set operations and avoid loops wherever possible. Here intersect could be extremely useful:
[u, idx_c, idx_f] = intersect(c(:, 1:2) , f(:, 1:2), 'rows');
n = [c(idx_c, :), f(idx_f, end)];
Explanation: by specifying the 'rows' flag, intersect finds the common rows in c and f, and their indices are given in idx_c and idx_f respectively. Use vector subscripting to extract matrix n.
Example
Let's use the example from your question:
c = [1 0 1.05;
1 3 2.05
1 6 2.52
1 9 0.88
2 0 2.58
2 3 0.53
2 6 3.69
2 9 0.18
3 0 3.22
3 3 1.88
3 6 3.98];
f = [1 6 3.9
1 9 9.1
1 12 9
2 0 0.3
2 3 0.9
2 6 1.2
2 9 2.5
3 0 2.7];
[u, idx_c, idx_f] = intersect(c(:, 1:2) , f(:, 1:2), 'rows');
n = [c(idx_c, :), f(idx_f, end)];
This should yield the desired result:
n =
1.0000 6.0000 2.5200 3.9000
1.0000 9.0000 0.8800 9.1000
2.0000 0 2.5800 0.3000
2.0000 3.0000 0.5300 0.9000
2.0000 6.0000 3.6900 1.2000
2.0000 9.0000 0.1800 2.5000
3.0000 0 3.2200 2.7000
According to this answer on Mathworks support you can use join from the statistics toolbox, specifically in your case, an inner join.
Unfortunately I don't have access to my computer with matlab on it, but give it a try and let us know how/if it works.
You can reduce the number of loops by comparing both the first and second columns of at once, then using the "all" function to only collapse the values if they both match. The following snippet replicates the "n" array you had provided.
n = [];
for r1 = 1:size(c, 1)
for r2 = 1:size(f,1)
if all(c(r1, [1 2]) == f(r2, [1 2]))
n(end+1, 1:4) = [c(r1,:) f(r2,3)];
end
end
end
If you insist on doing this in a loop you need to give n the proper dimension according
to the loop counter you are using, or concatenate it to itself of each iteration (this can be very slow for big matrices). For example, writing:
for j=1
for i=1:rs1
for k=1
for l=1:rs2
m=m+1;
if f(i,j)==c(l,k) && f(i,j+1)==c(l,k+1)
n(m,:)=[f(i,j),f(i,j+1),f(i,j+2), c(l,k+2)];
end
end
end
end
end
will save into the m-th row the for numbers when the loop reaches a counter value of m.
However, just be aware that this can be done also without a nested loop and an if condition, in a vectorized way. For example, instead of the condition if f(i,j)==c(l,k)... you can use ismember etc...
How about without any for loops at all (besides in native code)
mf = size(f,1);
mc = size(c,1);
a = repmat(c(:,1:2),1,mf);
b = repmat(reshape((f(:,1:2))',1,[]),mc,1);
match = a == b;
match = match(:, 1 : 2 : 2*mf) & match(:, 2 : 2 : 2*mf);
crows = nonzeros(diag(1:mc) * match);
frows = nonzeros(match * diag(1:mf));
n = [c(crows,:),f(frows,3)]

I want to calculate the mean of two rows in matlab only if the first elemensts of thee two rows are equal [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
I want to calculate the mean of two rows in matlab
I am sorry for repeating myself but I am stuck at a point.
I have a 1028 by 18 matrix with some entire rows having NaN values.So I need to compare the first two elements of adjacent rows and calculate the average only if the first two elements are equal.
D is the 1028 by 18 matrix
[m,n]=size(D);
for i=1:m-1
if D(i,1)==D(i+1,1)
D=reshape(D, 2, m/2*n);
D=(D(i,:)+D(i+1,:))/2;
D=reshape(D, m/2, n);
else
end
end
I do not have matlab with me but the logic will be something like this
for row=0,row++
if ( m[row,1] == m[row+1,1])
{
mean1 = mean(m[row]);
mean2 = mean(m[row+1]);
mean = mean(mean1,mean2);
}
end for
/* note this syntax is not correct, it just give you the idea */
You can have a logical index of the valid rows according to your definition using all(~diff(D(:,1:2)), 2), i.e. the elements where both first and second column, row-wise difference is zero.
Then you can use this index to return either integer line indices or index within a matrix of global row-wise mean.
index_row = 1:1:size(D, 1); % linear row index
index_valid = all(~diff(D(:,1:2)), 2); % valid rows (logical)
mean_matrix = (D(1:end-1,:) + D(2:end,:))/2; % matrix of all means
% matrix of valid mean rows only
mean_matrix_valid = mean_matrix(index_valid,:); % logical index
% linear index of valid rows, i.e. the pairs indexed (i, i+1)
index_row_valid = index_row(index_valid); % valid rows (int)
For example with
D = [1 2 3 4 5; 1 1 1 1 1; 1 2 4 4 4; 1 2 3 3 3; 2 2 2 2 2; 2 2 3 3 3];
>> D =
1 2 3 4 5
1 1 1 1 1
1 2 4 4 4
1 2 3 3 3
2 2 2 2 2
2 2 3 3 3
you will get, using the above
>> index_valid =
0
0
1
0
1
>> index_row_valid =
3 5
>> mean_matrix_valid =
1.0000 2.0000 3.5000 3.5000 3.5000
2.0000 2.0000 2.5000 2.5000 2.5000
which are the means of rows (3,4) and (5,6) respectively.

Cartesian product in MATLAB

Here is the simplified version of the problem I have. Suppose I have a vector
p=[1 5 10]
and another one
q=[.75 .85 .95]
And I want to come up with the following matrix:
res=[1, .75;
1, .85;
1, .95;
5, .75;
5, .85;
5, .95;
10, .75;
10, .85;
10, .95]
This is also known as the Cartesian Product.
How can I do that?
Here's one way:
[X,Y] = meshgrid(p,q);
result = [X(:) Y(:)];
The output is:
result =
1.0000 0.7500
1.0000 0.8500
1.0000 0.9500
5.0000 0.7500
5.0000 0.8500
5.0000 0.9500
10.0000 0.7500
10.0000 0.8500
10.0000 0.9500
A similar approach as the one described by #nibot can be found in matlab central file-exchange.
It generalizes the solution to any number of input sets. This would be a simplified version of the code:
function C = cartesian(varargin)
args = varargin;
n = nargin;
[F{1:n}] = ndgrid(args{:});
for i=n:-1:1
G(:,i) = F{i}(:);
end
C = unique(G , 'rows');
end
For instance:
cartesian(['c','d','e'],[1,2],[50,70])
ans =
99 1 50
99 1 70
99 2 50
99 2 70
100 1 50
100 1 70
100 2 50
100 2 70
101 1 50
101 1 70
101 2 50
101 2 70
Here's a function, cartesian_product, that can handle any type of input, including string arrays, and returns a table with column names that match the names of the input variables. Inputs that are not variables are given names like var1, var2, etc.
function tbl = cartesian_product(varargin)
names = arrayfun(#inputname, 1:nargin, 'UniformOutput', false);
for i = 1:nargin
if isempty(names{i})
names{i} = ['var' num2str(i)];
end
end
rev_args = flip(varargin);
[A{1:nargin}] = ndgrid(rev_args{:});
B = cellfun(#(x) x(:), A, 'UniformOutput', false);
C = flip(B);
tbl = table(C{:}, 'VariableNames', names);
end
>> x = ["a" "b"];
>> y = 1:3;
>> z = 4:5;
>> cartesian_product(x, y, z)
ans =
12×3 table
x y z
___ _ _
"a" 1 4
"a" 1 5
"a" 2 4
"a" 2 5
"a" 3 4
"a" 3 5
"b" 1 4
"b" 1 5
"b" 2 4
"b" 2 5
"b" 3 4
"b" 3 5
>> cartesian_product(1:2, 3:4)
ans =
4×2 table
var1 var2
____ ____
1 3
1 4
2 3
2 4