Cells of different sizes - Overcoming exceed index error - matlab

I want to find the r^2 for each of the 3rd dimensions (the 3rd dimension is basically columns of data). However, in trying to index into each of the cells with a for loop (to loop through the states and then loop through the sets of data), I run into exceed index issues since some of the third dimensions are small, while others are larger.
I tried to sort the cells first:
[dummy, Index] = sort(cellfun('size', data_O3_spring, 3), 'descend');
S = data_O3_spring(Index);
And then loop through and find the corrcoef (using the data set data_O3_spring, which is in the same form as described above):
for k = 1:7 % Number of states
for j = 1:17 % largest number of sites
r2_spring{k}(:,:,j) = power((corrcoef(S{k}(:,:,j), data_PM25_spring{k}(:,:,j), 'rows', 'pairwise')), 2);
end
end
However, this gives me an exceed index error when I go above 5 (the size of the smallest set of data.
About the format of my data: data_O3_spring is a <1x7> cell containing data for 7 states for the months in spring.
data_O3_spring{1} (one of the states) has 7 cells (different sets of data I'm looking at), each of which is size:
<61x1x7 double>
<61x1x17 double>
<61x1x8 double>
<61x1x16 double>
<61x1x5 double>
<61x1x12 double>
<61x1x13 double>
61 is the number of days (rows). There's 1 column. And the third dimension size is the number of sets of data I'm looking at in that particular state (so it varies by state).
I tried using a while loop, but didn't manage to get it to work either.

I may be missing a detail, but it seems you can change your loop from:
for j=1:17,
to
for j = 1:size(S{k},3),
Each state has a different number of sites, and that's fine because you are storing the output in a cell array (r2_spring{k}(:,:,j)), which does not require that the dimension indexed by j be equal.
Also, pairing corrcoef(S{k}(:,:,j) with data_O3_spring{k}(:,:,j) is a problem since you've reordered data_O3_spring into S. I'd say to try either:
corrcoef(S{k}(:,:,j), S{k}(:,:,j), 'rows', 'pairwise')
or
corrcoef(data_O3_spring{k}(:,:,j), data_O3_spring{k}(:,:,j), 'rows', 'pairwise')

Related

MATLAB automatically assigns unwanted dimensions to a dynamically updated cell array

I want to generate a 3D cell array called timeData so that timeData(:,:,a) for some integer a is an nx1 matrix of data, and the number of rows n varies with the value of a in a 1:1 correspondence. To do this, I am generating a 2D array of data called data that is nx1. This assignment statement takes place within a for loop as follows:
% Before iterating, I define an array of indices where I want to store the
% data sets in timeData. This choice of storage location is for
% organizational purposes.
A = [2, 5, 9, 21, 34, 100]; % Notice they are in ascending order, but have
% gaps that have no predictability.
sizeA = size(A);
numIter = A(1);
for m = 1:numIter % numIter is the number of data sets that I need to store
% in timeData
% At this point, some code that is entirely irrelevant to my question
% generates a nx1 array of data. One example of this data array is below.
data = [1.1;2.3;5.5;4.4]; % This is one example of what data could be. Its
% number of rows, n, changes each iteration, as
% do its contents.
B = size(data);
timeData(1:B(1),1,A(m)) = num2cell(data);
end
This code does put all contents of data in the appropriate locations within timeData as I want. However, it also adds {0x0 double} rows to all 2D arrays of timeData(:,:,a) for any a whose corresponding number of rows n was not the largest number of rows. Thus, there are many of these 2D arrays that have 10 to a couple hundred 0-valued rows that I don't want. For values of a that did not have a corresponding data set, the content of timeData(:,:,a) is an nx1 array of {0x0 double}.
I need to iterate over the contents of timeData in subsequent code, and I need to be able to find the size of the data set that is in timeData(:,:,a) without somehow discounting all the {0x0 double}.
How can I modify my assignment statement to fix this?
Edit: Desired output of the above example is the following with n = 5. Let this data set be represented by a = 9.
timeData(:,:,9) = {[1.1]}
{[2.3]}
{[5.5]}
{[8.6]}
{[4.4]}
Now, consider the possibility that a previous or subsequent value of the A matrix had a data set with n = 7, and n = 7 is the largest data set (largest n value). timeData(:,:,9) outputs like so in my code:
timeData(:,:,9) = {[1.1]}
{[2.3]}
{[5.5]}
{[8.6]}
{[4.4]}
{[0x0 double]}
{[0x0 double]}
#Dev-iL, as I understand it, your answer gives me the ability to delete the cells that have {[0x0 double]} in them (this is what I mean by "discounting"). This is a good plan B, but is there a way to prevent the {[0x0 double]} cells from showing up in the first place?
Edit 2: Update to the above statement "your answer gives me the ability to delete the cells that have {[0x0 double]} in them (this is what I mean by "discounting")". The cellfun(#isempty... ) function makes the {[0x0 double]}cells go to {[0x0 cell]}, it does not remove them. In other words, size(timeData(:,:,9)) is the same before and after the command is performed. This is not what I want. I want size(timeData(:,:,9)) to be 5x1 no matter what n is for any other value of a.
Edit 3: I just realized that the most desired output would be the following:
timeData(:,:,9) = {[1.1;2.3;5.5;8.6;4.4]} % An n x 1 column matrix within
% the cell.
but I can work with this outcome or the outcome as described above.
Unfortunately, I don't understand the structure of your dataset, which is why I can't suggest a better assignment method. However, I'd like to point out an operation that can you help deal with your data after it's been created:
cellfun(#isempty,timeData);
What the above does is return a logical array the size of timeData, indicating which cells contain something "empty". Typically, an array of arbitrary datatype is considered "empty" when it has at least one dimension that is equal to 0.
How can you use it to your advantage?
%% Example 1: counting non-empty cells:
nData = sum(~cellfun(#isempty,timeData(:)));
%% Example 2: assigning empty cells in place of empty double arrays:
timeData(cellfun(#isempty,timeData)) = {{}};

Using Matlab to randomly split an Excel Sheet

I have an Excel sheet containing 1838 records and I need to RANDOMLY split these records into 3 Excel Sheets. I am trying to use Matlab but I am quite new to it and I have just managed the following code:
[xlsn, xlst, raw] = xlsread('data.xls');
numrows = 1838;
randindex = ceil(3*rand(numrows, 1));
raw1 = raw(:,randindex==1);
raw2 = raw(:,randindex==2);
raw3 = raw(:,randindex==3);
Your general procedure will be to read the spreadsheet into some matlab variables, operate on those matrices such that you end up with three thirds and then write each third back out.
So you've got the read covered with xlsread, that results in the two matrices xlsnum and xlstxt. I would suggest using the syntax
[~, ~, raw] = xlsread('data.xls');
In the xlsread help file (you can access this by typing doc xlsread into the command window) it says that the three output arguments hold the numeric cells, the text cells and the whole lot. This is because a matlab matrix can only hold one type of value and a spreadsheet will usually be expected to have text or numbers. The raw value will hold all of the values but in a 'cell array' instead, a different kind of matlab data type.
So then you will have a cell array valled raw. From here you want to do three things:
work out how many rows you have (I assume each record is a row) by using the size function and specifying the appropriate dimension (again check the help file to see how to do this)
create an index of random numbers between 1 and 3 inclusive, which you can use as a mask
randindex = ceil(3*rand(numrows, 1));
apply the mask to your cell array to extract the records matching each index
raw1 = raw(:,randindex==1); % do the same for the other two index values
write each cell back to a file
xlswrite('output1.xls', raw1);
You will probably have to fettle the arguments to get it to work the way you want but be sure to check the doc functionname page to get the syntax just right. Your main concern will be to get the indexing correct - matlab indexes row-first whereas spreadsheets tend to be column-first (e.g. cell A2 is column A and row 2, but matlab matrix element M(1,2) is the first row and the second column of matrix M, i.e. cell B1).
UPDATE: to split the file evenly is surprisingly more trouble: because we're using random numbers for the index it's not guaranteed to split evenly. So instead we can generate a vector of random floats and then pick out the lowest 33% of them to make index 1, the highest 33 to make index 3 and let the rest be 2.
randvec = rand(numrows, 1); % float between 0 and 1
pct33 = prctile(randvec,100/3); % value of 33rd percentile
pct67 = prctile(randvec,200/3); % value of 67th percentile
randindex = ones(numrows,1);
randindex(randvec>pct33) = 2;
randindex(randvec>pct67) = 3;
It probably still won't be absolutely even - 1838 isn't a multiple of 3. You can see how many members each group has this way
numel(find(randindex==1))

Finding the largest vector inside a matrix

I am trying to find the largest vector inside a matrix compound by vectors with MATLAB, however I am having some difficulties, so I would be very thankful if someone help me. I have this:
The matrix paths (solution of a Dijkstra function), which is a 1000x1000 matrix, whose values are vectors of 1 row and different number of columns (when the columns are bigger than 10, the values appear as "1x11 double, 1x12 double, etc"). The matrix paths have this form:
1 2 3 ....
1 1 <1x20 double> <1x16 double>
2 <1x20 double> 2 [2,870,183,492,641,863,611,3]
3 <1x16 double> [3,611,863,641,492,183,870,2] 3
4 <1x25 double> <1x12 double> <1x14 double>
.
.
.
At first I thought in just finding the largest vector in the matrix by
B = max(length(paths))
However MATLAB returns B = 1000, value which is feasible, but not likely. When trying to find out the position of the vector by using:
[row,column] = find(length(paths) == B)
MATLAB returns row = 1, column = 1, which for sure is wrong... I have thought that maybe is a problem of how MATLAB takes the data. It is like it doesn't consider the entries of the matrix as vectors, because when I enter:
length(paths(3,2))
It returns me 1, but it should return 8 as I understand, also when introducing:
paths(3,2)
It returns [1x8 double] but I expect to see the whole vector. I don't know what to do, maybe a "for" loop, I really do not know if MATLAB takes the data of the matrix as vectors or as simple double values.
The cell with the largest vector can be found using cellfun and numel to get the number of elements in each numeric matrix stored in the cells of paths:
vecLens = cellfun(#numel,paths);
[maxLen,im] = max(vecLens(:));
[rowMax,colMax] = ind2sub(size(vecLens),im)
This gets a 1000x1000 numeric matrix vecLens containing the sizes, max gets the linear index of the largest element, and ind2sub translates that to row,column indexes.
A note on length: It gives you the size of the largest dimension. The size of paths is 1000x1000, so length(paths) is 1000. My advice is, Don't ever use length. Use size, specifying the dimension you want.
If multiple vectors are the same length, you get the first one with the above approach. To get all of them (starting after the max command):
maxMask = vecLens==maxLen;
if nnz(maxMask)>1,
[rowMax,colMax] = find(maxMask);
else
[rowMax,colMax] = ind2sub(size(vecLens),im)
end
or just
[rowMax,colMax] = find(vecLens==maxLen);

Matlab matching first column of a row as index and then averaging all columns in that row

I need help with taking the following data which is organized in a large matrix and averaging all of the values that have a matching ID (index) and outputting another matrix with just the ID and the averaged value that trail it.
File with data format:
(This is the StarData variable)
ID>>>>Values
002141865 3.867144e-03 742.000000 0.001121 16.155089 6.297494 0.001677
002141865 5.429278e-03 1940.000000 0.000477 16.583748 11.945627 0.001622
002141865 4.360715e-03 1897.000000 0.000667 16.863406 13.438383 0.001460
002141865 3.972467e-03 2127.000000 0.000459 16.103060 21.966853 0.001196
002141865 8.542932e-03 2094.000000 0.000421 17.452007 18.067214 0.002490
Do not be mislead by the examples I posted, that first number is repeated for about 15 lines then the ID changes and that goes for an entire set of different ID's, then they are repeated as a whole group again, think first block of code = [1 2 3; 1 5 9; 2 5 7; 2 4 6] then the code repeats with different values for the columns except for the index. The main difference is the values trailing the ID which I need to average out in matlab and output a clean matrix with only one of each ID fully averaged for all occurrences of that ID.
Thanks for any help given.
A modification of this answer does the job, as follows:
[value_sort ind_sort] = sort(StarData(:,1));
[~, ii, jj] = unique(value_sort);
n = diff([0; ii]);
averages = NaN(length(n),size(StarData,2)); % preallocate
averages(:,1) = StarData(ii,1);
for col = 2:size(StarData,2)
averages(:,col) = accumarray(jj,StarData(ind_sort,col))./n;
end
The result is in variable averages. Its first column contains the values used as indices, and each subsequent column contains the average for that column according to the index value.
Compatibility issues for Matlab 2013a onwards:
The function unique has changed in Matlab 2013a. For that version onwards, add 'legacy' flag to unique, i.e. replace second line by
[~, ii, jj] = unique(value_sort,'legacy')

How can I generate a binary matrix with specific patterns?

I have a binary matrix of size m-by-n. Given below is a sample binary matrix (the real matrix is much larger):
1010001
1011011
1111000
0100100
Given p = m*n, I have 2^p possible matrix configurations. I would like to get some patterns which satisfy certain rules. For example:
I want not less than k cells in the jth column as zero
I want the sum of cell values of the ith row greater than a given number Ai
I want at least g cells in a column continuously as one
etc....
How can I get such patterns satisfying these constraints strictly without sequentially checking all the 2^p combinations?
In my case, p can be a number like 2400, giving approximately 2.96476e+722 possible combinations.
Instead of iterating over all 2^p combinations, one way you could generate such binary matrices is by performing repeated row- and column-wise operations based on the given constraints you have. As an example, I'll post some code that will generate a matrix based on the three constraints you have listed above:
A minimum number of zeroes per column
A minimum sum for each row
A minimum sequential length of ones per column
Initializations:
First start by initializing a few parameters:
nRows = 10; % Row size of matrix
nColumns = 10; % Column size of matrix
minZeroes = 5; % Constraint 1 (for columns)
minRowSum = 5; % Constraint 2 (for rows)
minLengthOnes = 3; % Constraint 3 (for columns)
Helper functions:
Next, create a couple of functions for generating column vectors that match constraints 1 and 3 from above:
function vector = make_column
vector = [false(minZeroes,1); true(nRows-minZeroes,1)]; % Create vector
[vector,maxLength] = randomize_column(vector); % Randomize order
while maxLength < minLengthOnes, % Loop while constraint 3 is not met
[vector,maxLength] = randomize_column(vector); % Randomize order
end
end
function [vector,maxLength] = randomize_column(vector)
vector = vector(randperm(nRows)); % Randomize order
edges = diff([false; vector; false]); % Find rising and falling edges
maxLength = max(find(edges == -1)-find(edges == 1)); % Find longest
% sequence of ones
end
The function make_column will first create a logical column vector with the minimum number of 0 elements and the remaining elements set to 1 (using the functions TRUE and FALSE). This vector will undergo random reordering of its elements until it contains a sequence of ones greater than or equal to the desired minimum length of ones. This is done using the randomize_column function. The vector is randomly reordered using the RANDPERM function to generate a random index order. The edges where the sequence switches between 0 and 1 are detected using the DIFF function. The indices of the edges are then used to find the length of the longest sequence of ones (using FIND and MAX).
Generate matrix columns:
With the above two functions we can now generate an initial binary matrix that will at least satisfy constraints 1 and 3:
binMat = false(nRows,nColumns); % Initialize matrix
for iColumn = 1:nColumns,
binMat(:,iColumn) = make_column; % Create each column
end
Satisfy the row sum constraint:
Of course, now we have to ensure that constraint 2 is satisfied. We can sum across each row using the SUM function:
rowSum = sum(binMat,2);
If any elements of rowSum are less than the minimum row sum we want, we will have to adjust some column values to compensate. There are a number of different ways you could go about modifying column values. I'll give one example here:
while any(rowSum < minRowSum), % Loop while constraint 2 is not met
[minValue,rowIndex] = min(rowSum); % Find row with lowest sum
zeroIndex = find(~binMat(rowIndex,:)); % Find zeroes in that row
randIndex = round(1+rand.*(numel(zeroIndex)-1));
columnIndex = zeroIndex(randIndex); % Choose a zero at random
column = binMat(:,columnIndex);
while ~column(rowIndex), % Loop until zero changes to one
column = make_column; % Make new column vector
end
binMat(:,columnIndex) = column; % Update binary matrix
rowSum = sum(binMat,2); % Update row sum vector
end
This code will loop until all the row sums are greater than or equal to the minimum sum we want. First, the index of the row with the smallest sum (rowIndex) is found using MIN. Next, the indices of the zeroes in that row are found and one of them is randomly chosen as the index of a column to modify (columnIndex). Using make_column, a new column vector is continuously generated until the 0 in the given row becomes a 1. That column in the binary matrix is then updated and the new row sum is computed.
Summary:
For a relatively small 10-by-10 binary matrix, and the given constraints, the above code usually completes in no more than a few seconds. With more constraints, things will of course get more complicated. Depending on how you choose your constraints, there may be no possible solution (for example, setting minRowSum to 6 will cause the above code to never converge to a solution).
Hopefully this will give you a starting point to begin generating the sorts of matrices you want using vectorized operations.
If you have enough constraints, exploring all possible matrices could be attempted:
// Explore all possibilities starting at POSITION (0..P-1)
explore(int position)
{
// Check if one or more constraints can't be verified anymore with
// all values currently set.
invalid = ...;
if (invalid) return;
// Do we have a solution?
if (position >= p)
{
// print the matrix
return;
}
// Set one more value and continue exploring
for (int value=0;value<2;value++)
{ matrix[position] = value; explore(position+1); }
}
If the number of constraints is low, this approach will take too much time.
In this case, for the kind of constraints you gave as examples, simulated annealing may be a good solution.
You must design an energy function, high when all constraints are met. That would be something like that:
Generate a random matrix
Compute energy E0
Change one cell
Compute energy E1
If E1>E0, or E0-E1 is smaller than f(temperature), keep it, otherwise reverse the move
Update temperature, and goto 2 unless stop criterion is reached
If all the contraints relate to columns (as is the case in the question), then you can find all possible valid columns and check that each column in the matrix is in this set. (i.e. when you consider each column independently, you reduce the number of possibilities a lot.)
I might be way off here, but I remember doing something similar once with some genetic algorithm.
Check out pseudo boolean constraints (also called 0-1 integer programming).
This is virtually impossible if your constraint set is complex enough. You might try to use a stochastic optimizer, like simulated annealing, particle swarm optimization, or a genetic algorithm to find a feasible solution.
However, if you can generate one (non-random) solution to such a problem, then often you can generate others by random permutations made to the existing solution.