MATLAB Cell Array - Average two values if another column matches - matlab

I have a cell array in which some of the entries have two data points. I want to average the two data points if the data were collected on the same day.
The first column of cell array 'site' is the date. The fourth column is the data concentration. I want to average the fourth column if the data comes from the same day.
For example, if my cell array looks like this:
01/01/2011 36-061-0069 1 10.4
01/01/2011 36-061-0069 2 10.1
01/04/2011 36-061-0069 1 7.9
01/05/2011 36-061-0069 1 13
I want to average the fourth column (10.4 and 10.1) into one row and leave everything else the same.
Help? Would an if elseif loop work? I'm not sure how to approach this issue, especially since cell arrays work a little differently than matrices.

You can do it succinctly without a loop, using a combination of unique, diff and accumarray.
Define data:
data = {'01/01/2011' '36-061-0069' '1' '10.4';
'01/01/2011' '36-061-0069' '2' '10.1';
'01/04/2011' '36-061-0069' '1' '7.9';
'01/05/2011' '36-061-0069' '1' '13'};
Then:
dates = datenum(data(:,1),2); % mm/dd/yyyy format. Change "2" for other formats
[dates_sort ind_sort] = sort(dates);
[~, ii, jj] = unique(dates_sort);
n = diff([0; ii]);
result = accumarray(jj,vertcat(str2double(data(ind_sort,4))))./n;
gives the desired result:
result =
10.2500
7.9000
13.0000
If needed, you can get the non-repeated, sorted dates with data(ind_sort(ii),1).
Explanation of the code: the dates are first converted to numbers and sorted. The unique dates and repeated dates are then extracted. Finally, data in repeated rows are summed and divided by the number of repetitions to obtain the averages.
Compatibility issues for Matlab 2013a onwards:
The function unique has changed in Matlab 2013a. For that version onwards, add 'legacy' flag to unique, i.e. replace the line [~, ii, jj] = unique(dates_sort) by
[~, ii, jj] = unique(dates_sort,'legacy')

It sounds like you want to do :
for it = 1:size(CellArray,1)
sum = sum + cellArray{it}(4) % or .NameOfColumn if it a cell containing struct
end
mean(sum)

Related

Using Matlab to randomly split an Excel Sheet

I have an Excel sheet containing 1838 records and I need to RANDOMLY split these records into 3 Excel Sheets. I am trying to use Matlab but I am quite new to it and I have just managed the following code:
[xlsn, xlst, raw] = xlsread('data.xls');
numrows = 1838;
randindex = ceil(3*rand(numrows, 1));
raw1 = raw(:,randindex==1);
raw2 = raw(:,randindex==2);
raw3 = raw(:,randindex==3);
Your general procedure will be to read the spreadsheet into some matlab variables, operate on those matrices such that you end up with three thirds and then write each third back out.
So you've got the read covered with xlsread, that results in the two matrices xlsnum and xlstxt. I would suggest using the syntax
[~, ~, raw] = xlsread('data.xls');
In the xlsread help file (you can access this by typing doc xlsread into the command window) it says that the three output arguments hold the numeric cells, the text cells and the whole lot. This is because a matlab matrix can only hold one type of value and a spreadsheet will usually be expected to have text or numbers. The raw value will hold all of the values but in a 'cell array' instead, a different kind of matlab data type.
So then you will have a cell array valled raw. From here you want to do three things:
work out how many rows you have (I assume each record is a row) by using the size function and specifying the appropriate dimension (again check the help file to see how to do this)
create an index of random numbers between 1 and 3 inclusive, which you can use as a mask
randindex = ceil(3*rand(numrows, 1));
apply the mask to your cell array to extract the records matching each index
raw1 = raw(:,randindex==1); % do the same for the other two index values
write each cell back to a file
xlswrite('output1.xls', raw1);
You will probably have to fettle the arguments to get it to work the way you want but be sure to check the doc functionname page to get the syntax just right. Your main concern will be to get the indexing correct - matlab indexes row-first whereas spreadsheets tend to be column-first (e.g. cell A2 is column A and row 2, but matlab matrix element M(1,2) is the first row and the second column of matrix M, i.e. cell B1).
UPDATE: to split the file evenly is surprisingly more trouble: because we're using random numbers for the index it's not guaranteed to split evenly. So instead we can generate a vector of random floats and then pick out the lowest 33% of them to make index 1, the highest 33 to make index 3 and let the rest be 2.
randvec = rand(numrows, 1); % float between 0 and 1
pct33 = prctile(randvec,100/3); % value of 33rd percentile
pct67 = prctile(randvec,200/3); % value of 67th percentile
randindex = ones(numrows,1);
randindex(randvec>pct33) = 2;
randindex(randvec>pct67) = 3;
It probably still won't be absolutely even - 1838 isn't a multiple of 3. You can see how many members each group has this way
numel(find(randindex==1))

Index of Max Value from by grouping using varfun

I have a table with Ids and Dates. I would like to retrieve the index of the max date for each Id.
My initial approach is so:
varfun(#max, table, 'Grouping Variables', 'Id', 'InputVariables','Date');
This obviously gives me the date rather than the index.
I noted that the max function will return both the maxvalue and maxindex when specified:
[max_val, max_idx] = max(values);
How can I define an anonymous function using max to retrieve max_idx? I would then use it in the var_fun to get my result.
I'd prefer not to declare a cover function (as opposed to a anon func)over max() as:
1. I'm working in a script and would rather not create another function file
2. I'm unwilling to change my current script to a function
Thanks a million guys,
I'm assuming your Ids are positive integers and and your Dates are numbers.
If you wanted the maximum Date for each Id, it would be a perfect case for accumarray with the max function. In the following I'll use f to denote a generic function passed to accumarray.
The fact that you want the index of the maximum makes it a little trickier (and more interesting!). The problem is that the Dates corresponding to a given Id are passed to f without any reference to their original index. Therefore, an f based on max can't help. But you can make the indices "pass through" accumarray as imaginary parts of the Dates.
So: if you want just one maximizing index (even if there are several) for each Id:
result = accumarray(t.Id,... %// col vector of Id's
t.Date+1j*(1:size(t,1)).', ... %'// col vector of Dates (real) and indices (imag)
[], ... %// default size for output
#(x) imag(x(find(real(x)==max(real(x))),1))); %// function f
Note that the function f here maximizes the real part and then extracts the imaginary part, which contains the original index.
Or, if you want all maximizing indices for each Id:
result = accumarray(t.Id,... %// col vector of Id's
t.Date+1j*(1:size(t,1)).', ... %'// col vector of Dates (real) and indices (imag)
[], ... %// default size for output
#(x) {imag(x(find(real(x)==max(real(x)))))}); %// function f
If your Ids are strings: transform them into numeric labels using the third output of unique, and then proceed as above:
[~, ~, NumId] = unique(t.Id);
and then either
result = accumarray(NumId,... %// col vector of Id's
t.Date+1j*(1:size(t,1)).', ... %'// col vector of Dates (real) and indices (imag)
[], ... %// default size for output
#(x) imag(x(find(real(x)==max(real(x))),1))); % function f
or
result = accumarray(NumId,... %// col vector of Id's
t.Date+1j*(1:size(t,1)).', ... %'// col vector of Dates (real) and indices (imag)
[], ... %// default size for output
#(x) {imag(x(find(real(x)==max(real(x)))))}); %// function f
I don't think varfun is the right approach here, as
varfun(func,A) applies the function func separately to each variable of
the table A.
This would only make sense if you wanted to apply it to multiple columns.
Simple approach:
Simply go with the loop approach: First find the different IDs using unique, then for each ID find the indices of the maximum dates. (This assumes your dates are in a numerical format which can be compared directly using max.)
I did rename your variable table to t, as otherwise we would be overwriting the built-in function table.
uniqueIds = unique(t.Id);
for i = 1:numel(uniqueIds)
equalsCurrentId = t.Id==uniqueIds(i);
globalIdxs = find(equalsCurrentId);
[~, localIdxsOfMax] = max(t.Date(equalsCurrentId));
maxIdxs{i} = globalIdxs(localIdxsOfMax);
end
As you mentioned your Ids are actually strings instead of numbers, you will have to change the line: equalsCurrentId = t.Id==uniqueIds(i); to
equalsCurrentId = strcmp(t.Id, uniqueIds{i});
Approach using accumarray:
If you prefer a more compact style, you could use this solution inspired by Luis Mendo's answer, which should work for both numerical and string Ids:
[uniqueIds, ~, global2Unique] = unique(t.Id);
maxDateIdxsOfIdxSubset = #(I) {I(nth_output(2, #max, t.Date(I)))};
maxIdxs = accumarray(global2Unique, 1:length(t.Id), [], maxDateIdxsOfIdxSubset);
This uses nth_output of gnovice's great answer.
Usage:
Both above solutions will yield: A vector uniqueIds with a corresponding cell-array maxIdxs, in a way that maxIdxs{i} are the indices of the maximum dates of uniqueIds(i).
If you only want a single index, even though there are multiple entries where the maximum is attained, use the following to strip away the unwanted data:
maxIdxs = cellfun(#(X) X(1), maxIdxs);

Count the number of times something showed up in the top 36 rows

I have a cell of size 1x7 where each cell inside of that is 365x5xN in which each N is a different location (siteID). It is already sorted according to column 5 (the columns are Lat, Lon, siteID, date, and data).
(The data can be found here: https://www.dropbox.com/sh/li3hh1nvt11vok5/4YGfwStQlo. Variable in question is PM25)
I want to go through the entire 1x7 cell and, looking at only the top 36 rows (basically, the top 10 percentile), count the number of times each date shows up. In other words, I want to know on which days the data value fell in the top 10 percentile.
Does anyone know how I can do this? I can't get my mind around how to approach this issue --> counting across all these cells and spitting out a quantity for each day of the year
Assuming you have a sorted cell array, you may use this -
%%// Get all the dates for all the rows in sorted cell array
all_dates = [];
for k1=1:size(sorted_cell,2)
all_dates = [all_dates reshape(cell2mat(sorted_cell{1,k1}(:,4,:)),1,[])];
end
all_unique_dates = unique(all_dates);
all_out = [num2cell(all_unique_dates)' num2cell(zeros(numel(all_unique_dates),1))];%%//'
%%// Get all the dates for the first 36 rows in sorted cell array
dates = [];
for k1=1:size(sorted_cell,2)
dates = [dates reshape(cell2mat(sorted_cell{1,k1}(1:36,4,:)),1,[])];
end
%%// Get unique dates and their counts
unique_dates = unique(dates);
count = histc(dates, unique_dates);
%%// As output create a cell array with the first column as dates
%%// and the second column as the counts
out = [num2cell(unique_dates)' num2cell(count)']
%%// Get all the dates and the corresponding counts.
%%// Thus many would still have counts as zeros.
all_out(ismember(all_unique_dates,unique_dates),:)=out;
Often when something looks tricky from the outside, it's easier to start from the inside instead. How can we get the top dates from a single array?
dates = unique(array(1:35,4));
Now, how to do that for each cell? A loop is always straightforward, but this is a pretty simple function, so let's use the one-liner:
datecell = cellfun(#(x) unique(x(1:35,4)), cellarray, 'UniformOutput', false);
Now we have a just the dates we want, for each cell. If there's no need to keep them separated, let's just stick them all together into one big array:
dates = cell2mat(datecell);
dates = unique(dates); % in case there are any duplicates
If you want to actually count each date as well (it's a little unclear), it might be a little too involved for an anonymous function, so we could either write our own function to pass to cellfun, or just cop out and stick it in a loop:
dates = {};
counts = {};
for ii = 1:length(cellarray)
[dates{ii}, ~, idx] = unique(cellarray{ii}(1:35,4));
counts{ii} = accumarray(idx, 1);
end
Now, those cell arrays may contain duplication, so we'll have to combine the counts where necessary in a similar way:
dates = cell2mat(dates);
counts = cell2mat(counts);
[dates, ~, idx] = unique(dates);
counts = accumarray(idx, counts); % add the counts of duplicated dates together
Note that re-assigning different data to the same variable names like this isn't particularly good practice - I'm just feeling exceptionally lazy tonight, and coming up with good, descriptive names is hard ;)

Matlab matching first column of a row as index and then averaging all columns in that row

I need help with taking the following data which is organized in a large matrix and averaging all of the values that have a matching ID (index) and outputting another matrix with just the ID and the averaged value that trail it.
File with data format:
(This is the StarData variable)
ID>>>>Values
002141865 3.867144e-03 742.000000 0.001121 16.155089 6.297494 0.001677
002141865 5.429278e-03 1940.000000 0.000477 16.583748 11.945627 0.001622
002141865 4.360715e-03 1897.000000 0.000667 16.863406 13.438383 0.001460
002141865 3.972467e-03 2127.000000 0.000459 16.103060 21.966853 0.001196
002141865 8.542932e-03 2094.000000 0.000421 17.452007 18.067214 0.002490
Do not be mislead by the examples I posted, that first number is repeated for about 15 lines then the ID changes and that goes for an entire set of different ID's, then they are repeated as a whole group again, think first block of code = [1 2 3; 1 5 9; 2 5 7; 2 4 6] then the code repeats with different values for the columns except for the index. The main difference is the values trailing the ID which I need to average out in matlab and output a clean matrix with only one of each ID fully averaged for all occurrences of that ID.
Thanks for any help given.
A modification of this answer does the job, as follows:
[value_sort ind_sort] = sort(StarData(:,1));
[~, ii, jj] = unique(value_sort);
n = diff([0; ii]);
averages = NaN(length(n),size(StarData,2)); % preallocate
averages(:,1) = StarData(ii,1);
for col = 2:size(StarData,2)
averages(:,col) = accumarray(jj,StarData(ind_sort,col))./n;
end
The result is in variable averages. Its first column contains the values used as indices, and each subsequent column contains the average for that column according to the index value.
Compatibility issues for Matlab 2013a onwards:
The function unique has changed in Matlab 2013a. For that version onwards, add 'legacy' flag to unique, i.e. replace second line by
[~, ii, jj] = unique(value_sort,'legacy')

MATLAB - Index exceeds matrix dimensions

Hi I have problem with matrix..
I have many .txt files with different number of rows but have the same number of column (1 column)
e.g. s1.txt = 1234 rows
s2.txt = 1200 rows
s2.txt = 1100 rows
I wanted to combine the three files. Since its have different rows .. when I write it to a new file I got this error = Index exceeds matrix dimensions.
How I can solved this problem? .
You can combine three matrices simply by stacking them: Assuming that s1, etc are the matrices you read in, you can make a new one like this:
snew = [s1; s2; s3];
You could also use the [] style stacking without creating the new matrix variable if you only need to do it once.
You have provided far too little information for an accurate diagnosis of your problem. Perhaps you have loaded the data from your files into variables in your workspace. Perhaps s1 has 1 column and 1234 rows, etc. Then you can concatenate the variables into one column vector like this:
totalVector = [s1; s2; s3];
and write it out to a file with a save() statement.
Does that help ?
Let me make an assumption that this question is connecting with your another question, and you want to combine those matrices by columns, leaving empty values in columns with fewer data.
In this case this code should work:
BaseFile ='s';
n=3;
A = cell(1,n);
for k=1:n
A{k} = dlmread([BaseFile num2str(k) '.txt']);
end
% create cell array with maximum number of rows and n number of columns
B = cell(max(cellfun(#numel,A)),n);
% convert each matrix in A to cell array and store in B
for k=1:n
B(1:numel(A{k}),k) = num2cell(A{k});
end
% save the data
xlswrite('output.txt',B)
The code assumes you have one column in each file, otherwise it will not work.