How to accumulate (average) data based on multiple criteria - matlab

I have a set of data where I have recorded values in sets of 3 readings (so as to be able to obtain a general idea of the SEM). I have them recorded in a list that looks as follows, which I am trying to collapse into averages of each set of 3 points:
I want to collapse essentially each 3 rows into one row where the average data value is given for that set. In essence, it would look as follows:
This is something I know how to do basically in Excel (i.e. using a Pivot table) but I am not sure how to do the same in MATLAB. I have tried using accumarray but struggle with knowing how to incorporate multiple conditions essentially. I would need to create a subs array where its number corresponds to each unique set of 3 data points. By brute force, I could create an array such as:
subs = [1 1 1; 2 2 2; 3 3 3; 4 4 4; ...]'
using some looping and have that as my subs array, but since it isn't tied to the data itself, and there may be strange hiccups throughout (i.e. more than 3 data points per set, or missing data, etc.). I know there must be some way to have this sort of Pivot-table-esque grouping for something like this, but need some help to get it off the ground. Thanks.
Here is the input data in text form:
Subject Flow On/Off Values
1 10 1 2.20
1 10 1 2.50
1 10 1 2.60
1 20 1 5.50
1 20 1 6.10
1 20 1 5.90
1 30 1 10.10
1 30 1 10.50
1 30 1 10.50
1 10 0 1.90
1 10 0 2.20
1 10 0 2.30
1 20 0 5.20
1 20 0 5.80
1 20 0 5.60
1 30 0 9.80
1 30 0 10.20
1 30 0 10.20
2 10 1 5.70
2 10 1 6.00
2 10 1 6.10
2 20 1 9.00
2 20 1 9.60
2 20 1 9.40
2 30 1 13.60
2 30 1 14.00
2 30 1 14.00
2 10 0 5.40
2 10 0 5.70
2 10 0 5.80
2 20 0 8.70
2 20 0 9.30
2 20 0 9.10
2 30 0 13.30
2 30 0 13.70
2 30 0 13.70

You can use unique and accumarray like so to maintain the order of your rows of data:
[newData, ~, subs] = unique(data(:, 1:3), 'rows', 'stable');
newData(:, 4) = accumarray(subs, data(:, 4), [], #mean);
newData =
1.0000 10.0000 1.0000 2.4333
1.0000 20.0000 1.0000 5.8333
1.0000 30.0000 1.0000 10.3667
1.0000 10.0000 0 2.1333
1.0000 20.0000 0 5.5333
1.0000 30.0000 0 10.0667
2.0000 10.0000 1.0000 5.9333
2.0000 20.0000 1.0000 9.3333
2.0000 30.0000 1.0000 13.8667
2.0000 10.0000 0 5.6333
2.0000 20.0000 0 9.0333
2.0000 30.0000 0 13.5667

I assume that
You want to average based on unique values of the first three columns (not on groups of three rows, although the two criteria coincide in your example);
Order is determined by column 1, then 3, then 2.
Then, denoting your data as x,
[~, ~, subs] = unique(x(:, [1 3 2]), 'rows', 'sorted');
result = accumarray(subs, x(:,end), [], #mean);
gives
result =
2.1333
5.5333
10.0667
2.4333
5.8333
10.3667
5.6333
9.0333
13.5667
5.9333
9.3333
13.8667
As you see, I am using the third output of unique with the 'rows' and 'sorted' options. This creates the subs grouping vector based on first three columns of your data in the desired order. Then, passing that to accumarray computes the means.

accumarray is indeed the way to go. First, you'll need assign an index to each set of values with unique :
[unique_subjects, ~, ind_subjects] = unique(vect_subjects);
[unique_flows, ~, ind_flows] = unique(vect_flows);
[unique_on_off, ~, ind_on_off] = unique(vect_on_off);
So basically, you now got ind_subjects, ind_flows and ind_on_off that are values in [1..2], [1..3] and [1..2].
Now, you can compute the mean values in a [3x2x2] array (in you example) :
mean_values = accumarray([ind_flows, ind_on_off, ind_subjects], vect_values, [], #mean);
mean_values = mean_values(:);
Nota : order is set accordingly to your example.
Then you can construct the summary :
[ind1, ind2, ind3] = ndgrid(1:numel(unique_flows), 1:numel(unique_on_off), 1:numel(unique_subjects));
flows_summary = unique_flows(ind1(:));
on_off_summary = unique_on_off(ind2(:));
subjects_summary = unique_subjects(ind3(:));
Nota : Also works with non numeric values.

You should also try checking out the findgroups and splitapply reference pages. The easiest way to use them here is probably to place your data in a table:
>> T = array2table(data, 'VariableNames', { 'Subject', 'Flow', 'On_Off', 'Values'});
>> [gid,Tgrp] = findgroups(T(:,1:3));
>> Tgrp.MeanValue = splitapply(#mean, T(:,4), gid)
Tgrp =
12×4 table
Subject Flow On_Off MeanValue
_______ ____ ______ _________
1 10 0 2.1333
1 10 1 2.4333
1 20 0 5.5333
1 20 1 5.8333
1 30 0 10.067
1 30 1 10.367
2 10 0 5.6333
2 10 1 5.9333
2 20 0 9.0333
2 20 1 9.3333
2 30 0 13.567
2 30 1 13.867

Related

MATLAB: Filter struct based on column value

i'm new to matlab, too used to python and having difficulty finding a way to filter a struct similar to how i can filter a pandas dataframe in python based on condition.
Matlab
a = arrayfun(#(x) x.value ==10, Data);
Data_10 = Data(a);
Error using arrayfun Non-scalar in Uniform output, at index 1, output
1. Set 'UniformOutput' to false.
How i would do so in python:
Data_10 = Data[Data.value == 10]
Try this:
Data_10 = zeros(size(Data.value));
Data_10(Data.value==10) == 10;
This should write into your array Data_10 the value 10 into each position, that has a 10 in Data and leave the rest as 0.
I am not sure if I fully understood your question. Here is my underestanding:
You want to filter certain values of an matrix.
Lets imagine we have a Matrix A filled with values. You want to filter values smaller than lowthresh = 0 and greater than upthresh = 5.
A = [3 6 -2.4 1; 0 34 4.76 0.5; 84 3 2.32 4; 1 -1 2 3.99];
lowthresh = 0;
upthresh = 5;
A(A<lowthresh | A>upthresh) = NaN; % Nan is a good flag
Output:
A =
3.0000 NaN NaN 1.0000
0 NaN 4.7600 0.5000
NaN 3.0000 2.3200 4.0000
1.0000 NaN 2.0000 3.9900
Having substituted your values you can do some basic functions ignoring NaNs:
For instance average:
mean(A,'omitnan')
ans =
1.3333 3.0000 3.0267 2.3725
I hope this adresses your question. Notice, that you can do this for any statement, that returns a boolean (isnan(), ... ) even if the boolean does not have anything to do with the matrix at all.
Lets say we have 2 matrizes that have the same size but different numbers:
A =
1 1 0
1 1 0
0 0 0
B =
0 0 0
0 0 0
0 0 0
We can easily say:
B(A==1) = 2
B =
2 2 0
2 2 0
0 0 0
I hope it helped a bit,
cheers Pablo

Matlab dlmread adds random zeros

I need to add data to array in matlab, I trying to use dlmread, but it adds random zeroes, how could I define rows length?
My file:
1 65.058 5 0
2 80.661 46 0
3 102.083 197 1
4 80.529 111 5
5 88.331 160 6
My line:
X = dlmread(Data, ' ', 0, 0);
Output:
1.0000 65.0580 5.0000
0 0 0
2.0000 80.6610 46.0000
0 0 0
3.0000 102.0830 197.0000
1.0000 0 0
4.0000 80.5290 111.0000
5.0000 0 0
There are two consecutive spaces in the first line of your file. This causes dlmread to add an additional column. I can't recreate your output (my version is R2015b), but I'm suspecting this is the culprit. You don't need to (and can't) define the number of rows or columns with dlmread; it's supposed to figure it out for itself by design. This shouldn't be a problem when your input data matches the expected format.

Normalization of inputs of a feedforward Neural network

Let's say I have a mxn matrix of different features of a time series signal (column 1 represents linear regression of the last n samples, column 2 represents the average of the last n samples, column 3 represents the local max values of a different time series but correlated signal, etc). How should I normalize these inputs? All the inputs fall into different categories, so they have a different range. One ranges from 0,1, the other ranges from -5 to 50, etc etc.
Should I normalize the WHOLE matrix? Or should I normalize each set of inputs one by one individually?
Note: I usually use mapminmax function from MATLAB for the normalization.
You should normalise each vector/column of your matrix individually, they represent different data types and shouldn't be mixed up together.
You could for example transpose your matrix to have your 3 different data types in the rows instead of in the columns of your matrix and still use mapminmax:
A = [0 0.1 -5; 0.2 0.3 50; 0.8 0.8 10; 0.7 0.9 20];
A =
0 0.1000 -5.0000
0.2000 0.3000 50.0000
0.8000 0.8000 10.0000
0.7000 0.9000 20.0000
B = mapminmax(A')
B =
-1.0000 -0.5000 1.0000 0.7500
-1.0000 -0.5000 0.7500 1.0000
-1.0000 1.0000 -0.4545 -0.0909
You should normalize each feature independently.
column 1 represents linear regression of the last n samples, column 2 represents the average of the last n samples, column 3 represents the local max values of a different time series but correlated signal, etc
I can't say for sure about your particular problem, but generally, you should normalize each feature independently. So normalize column 1, then column 2 etc.
Should I normalize the WHOLE matrix? Or should I normalize each set of inputs one by one individually?
I'm not sure what you mean here. What is an input? If by that you mean an instance (a row of your matrix), then no, you should not normalize rows individually, but columns.
I don't know how you would do this in Matlab, but I took your question more as a theoretical one than an implementation one.
If you want to have a range of [0,1] for all the columns that normalized within each column, you can use mapminmax like so (assuming A as the 2D input array) -
out = mapminmax(A.',0,1).'
You can also use bsxfun for the same output, like so -
Aoffsetted = bsxfun(#minus,A,min(A,[],1))
out = bsxfun(#rdivide,Aoffsetted,max(Aoffsetted,[],1))
Sample run -
>> A
A =
3 7 4 2 7
1 3 4 5 7
1 9 7 5 3
8 1 8 6 7
>> mapminmax(A.',0,1).'
ans =
0.28571 0.75 0 0 1
0 0.25 0 0.75 1
0 1 0.75 0.75 0
1 0 1 1 1
>> Aoffsetted = bsxfun(#minus,A,min(A,[],1));
>> bsxfun(#rdivide,Aoffsetted,max(Aoffsetted,[],1))
ans =
0.28571 0.75 0 0 1
0 0.25 0 0.75 1
0 1 0.75 0.75 0
1 0 1 1 1

Matlab Conditional probability from dataset

I have a Matrix M of 500x5 and I need to calculate conditional probability. I have discretised my data and then I have this code that currently only works with 3 variables rather than 5 but that's fine for now.
The code below already works out the number of times I get A=1, B=1 and C=1, the number of times we get A=2, B=1, C=1 etc.
data = M;
npatients=size(data,1)
asum=zeros(4,2,2)
prob=zeros(4,2,2)
for patient=1:npatients,
h=data(patient,1)
i=data(patient,2)
j=data(patient,3)
asum(h,i,j)=asum(h,i,j)+1
end
for h=1:4,
for i=1:2,
for j=1:2,
prob(h,i,j)=asum(h,i,j)/npatients
end
end
end
So I need code to sum over to get the number of time we get A=1 and B=1 (adding over all C) to find:
Prob(C=1 given A=1 and B=1) = P(A=1,B=1, C=1)/P( A=1, B=1).
This is the rule strength of the first rule. I need to find out how to loop over A, B and C to get the rest and how to actually get this to work in Matlab. I don't know if its of any use but I have code to put each column into its own thing.:
dest = M(:,1); gen = M(:,2); age = M(:,3); year = M(:,4); dur = M(:,5);
So say dest is the consequent and gen and age are the antecedents how would I do this.
Below is the data of the first 10 patients as an example:
destination gender age
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 1 1
3 2 2
2 2 2
3 2 1
3 2 1
Any help is appreciated and badly needed.
Sine your code didn't work by copy & paste, I changed it a little bit,
It's better if you define a function that calculates the probability for given data,
function p = prob(data)
n = size(data,1);
uniquedata = unique(data);
p = zeros(length(uniquedata),2);
p(:,2) = uniquedata;
for i = 1 : size(uniquedata,1)
p(i,1) = sum(data == uniquedata(i)) / n;
end
end
Now in another script,
data =[3 2 91;
3 2 86;
3 2 90;
3 2 85;
3 2 86;
3 1 77;
4 2 88;
3 2 90;
4 2 79;
4 2 77;
4 1 65;
3 1 60];
pdest = prob(data(:,1));
pgend = prob(data(:,2));
page = prob(data(:,3));
This will give,
page =
0.0833 60.0000
0.0833 65.0000
0.1667 77.0000
0.0833 79.0000
0.0833 85.0000
0.1667 86.0000
0.0833 88.0000
0.1667 90.0000
0.0833 91.0000
pgend =
0.2500 1.0000
0.7500 2.0000
pdest =
0.6667 3.0000
0.3333 4.0000
That will give the probabilities you've already calculated,
Note that the second column of prob is the valuse and the first column the probability.
When you want to calculate probabilities for des = 3 & gend = 2 you should create a new data set and call prob, for new data set use,
mapd2g3 = data(:,1) == 3 & data(:,2) == 2;
datad2g3 = data(mapd2g3,:)
3 2 91
3 2 86
3 2 90
3 2 85
3 2 86
3 2 90
paged2g3 = prob(datad2g3(:,3))
0.1667 85.0000
0.3333 86.0000
0.3333 90.0000
0.1667 91.0000
This is the prob(age|dest = 3 & gend = 2) .
You could even write a function to create the data sets.

Combine matrices using loop and condition in matlab

I have the following two matrices
c=[1 0 1.05
1 3 2.05
1 6 2.52
1 9 0.88
2 0 2.58
2 3 0.53
2 6 3.69
2 9 0.18
3 0 3.22
3 3 1.88
3 6 3.98]
f=[1 6 3.9
1 9 9.1
1 12 9
2 0 0.3
2 3 0.9
2 6 1.2
2 9 2.5
3 0 2.7]
And the final matrix should be
n=[1 6 2.52 3.9
1 9 0.88 9.1
2 0 2.58 0.3
2 3 0.53 0.9
2 6 3.69 1.2
2 9 0.18 2.5
3 0 3.22 2.7]
The code I used gives as a result only the last row of the previous matrix [n].
for j=1
for i=1:rs1
for k=1
for l=1:rs2
if f(i,j)==c(l,k) && f(i,j+1)==c(l,k+1)
n=[f(i,j),f(i,j+1),f(i,j+2), c(l,k+2)];
end
end
end
end
end
Can anyone help me on this?
Is there something more simple?
Thanks in advance
You should learn to use set operations and avoid loops wherever possible. Here intersect could be extremely useful:
[u, idx_c, idx_f] = intersect(c(:, 1:2) , f(:, 1:2), 'rows');
n = [c(idx_c, :), f(idx_f, end)];
Explanation: by specifying the 'rows' flag, intersect finds the common rows in c and f, and their indices are given in idx_c and idx_f respectively. Use vector subscripting to extract matrix n.
Example
Let's use the example from your question:
c = [1 0 1.05;
1 3 2.05
1 6 2.52
1 9 0.88
2 0 2.58
2 3 0.53
2 6 3.69
2 9 0.18
3 0 3.22
3 3 1.88
3 6 3.98];
f = [1 6 3.9
1 9 9.1
1 12 9
2 0 0.3
2 3 0.9
2 6 1.2
2 9 2.5
3 0 2.7];
[u, idx_c, idx_f] = intersect(c(:, 1:2) , f(:, 1:2), 'rows');
n = [c(idx_c, :), f(idx_f, end)];
This should yield the desired result:
n =
1.0000 6.0000 2.5200 3.9000
1.0000 9.0000 0.8800 9.1000
2.0000 0 2.5800 0.3000
2.0000 3.0000 0.5300 0.9000
2.0000 6.0000 3.6900 1.2000
2.0000 9.0000 0.1800 2.5000
3.0000 0 3.2200 2.7000
According to this answer on Mathworks support you can use join from the statistics toolbox, specifically in your case, an inner join.
Unfortunately I don't have access to my computer with matlab on it, but give it a try and let us know how/if it works.
You can reduce the number of loops by comparing both the first and second columns of at once, then using the "all" function to only collapse the values if they both match. The following snippet replicates the "n" array you had provided.
n = [];
for r1 = 1:size(c, 1)
for r2 = 1:size(f,1)
if all(c(r1, [1 2]) == f(r2, [1 2]))
n(end+1, 1:4) = [c(r1,:) f(r2,3)];
end
end
end
If you insist on doing this in a loop you need to give n the proper dimension according
to the loop counter you are using, or concatenate it to itself of each iteration (this can be very slow for big matrices). For example, writing:
for j=1
for i=1:rs1
for k=1
for l=1:rs2
m=m+1;
if f(i,j)==c(l,k) && f(i,j+1)==c(l,k+1)
n(m,:)=[f(i,j),f(i,j+1),f(i,j+2), c(l,k+2)];
end
end
end
end
end
will save into the m-th row the for numbers when the loop reaches a counter value of m.
However, just be aware that this can be done also without a nested loop and an if condition, in a vectorized way. For example, instead of the condition if f(i,j)==c(l,k)... you can use ismember etc...
How about without any for loops at all (besides in native code)
mf = size(f,1);
mc = size(c,1);
a = repmat(c(:,1:2),1,mf);
b = repmat(reshape((f(:,1:2))',1,[]),mc,1);
match = a == b;
match = match(:, 1 : 2 : 2*mf) & match(:, 2 : 2 : 2*mf);
crows = nonzeros(diag(1:mc) * match);
frows = nonzeros(match * diag(1:mf));
n = [c(crows,:),f(frows,3)]