Matlab - storing data from a loop in a Matrix (not a vector) - matlab

as a part of a bigger script i want to store data from a while loop in a matrix. I want to save parts of the COG_Ton_Av matrix which is 1738x3 in a new matrix. The COG_Ton_Av changes within every loop so i want to store the results outside. I have found multiple entries on how to store the data in a vector, but nothing for a matrix. What i tried is :
valuesforts= zeros(1000,3);
yr =1
while Qn>0
yindex = Gmhk*100
zindex = round(gs*100)
ts = (COG_Ton_Av ((zindex:yindex),:))
valuesforts(yr)=ts
yr = yr+1
end
I just posted parts of the while loop to make the question easier, I hope it is sufficient to answer the question.
While trying this i get following error:
Subscripted assignment dimension mismatch.
Error in cutoff_work14_priceescalation_and_stockpiling (line 286)
valuesforts(yr)=ts

The error means that ts is a different size to valuesforts (and it is indexed with yr as a vector.
If dimensions of TS vary on each iteration of the loop then use cell notation:
valuesforts = cell(<number of years>);
...
valuesforts{yr} = ts;
then the dimensions of ts won't matter.
To extract data also use { } e.g.
meanValues(yr) = mean(valuesforts{yr});
Bear in mind that the matrix within each cell of valuesforts will have same distentions as ts when it was assigned.
Alternatively, if TS is always the same size the pre-allocate valuesforts as:
valuesforts = zeros(<number of years>,<expected length of ts>,3);
...
valuesforts(yr,:,:) = ts;
Then depends on what you want to do with valuesforts.. reshape it or plot it.
In the worst case (not recommended), you can let the valuesforts grow with every loop iteration.
initialise with empty:
valuesforts=[];
then vertically append ts to valuesforts:
valuesforts = [valuesforts; ts];
this would give you a matrix with 3 columns and number of years * number of rows in ts in each loop iteration.

Related

How can I avoid constructing these grid variables in MATLAB?

I have the following calculations in two steps:
Initially, I create a set of 4 grid vectors, each spanning from -2 to 2:
u11grid=[-2:0.1:2];
u12grid=[-2:0.1:2];
u22grid=[-2:0.1:2];
u21grid=[-2:0.1:2];
[ca, cb, cc, cd] = ndgrid(u11grid, u12grid, u22grid, u21grid);
u11grid=ca(:);
u12grid=cb(:);
u22grid=cc(:);
u21grid=cd(:);
%grid=[u11grid u12grid u22grid u21grid]
sg=size(u11grid,1);
Next, I have an algorithm assigning the same index (equalorder) to the rows of grid sharing a specific structure:
U1grid=[-u11grid -u21grid -u12grid -u22grid Inf*ones(sg,1) -Inf*ones(sg,1)];
U2grid=[u21grid-u11grid -u21grid u22grid-u12grid -u22grid Inf*ones(sg,1) -Inf*ones(sg,1)];
s1=size(U1grid,2);
s2=size(U2grid,2);
%-------------------------------------------------------
%sortedU1grid gives U1grid with each row sorted from smallest to largest
%for each row i of sortedU1grid and for j=1,2,...,s1 index1(i,j) gives
%the column position 1,2,...,s1 in U1grid(i,:) of sortedU1grid(i,j)
[sortedU1grid,index1] = sort(U1grid,2);
%for each row i of sortedU1grid, d1(i,:) is a 1x(s1-1) row of ones and zeros
% d1(i,j)=1 if sortedU1grid(i,j)-sortedU1grid(i,j-1)=0 and d1(i,j)=0 otherwise
d1 = diff(sortedU1grid,[],2) == 0;
%-------------------------------------------------------
%Repeat for U2grid
[sortedU2grid,index2] = sort(U2grid,2);
d2 = diff(sortedU2grid,[],2) == 0;
%-------------------------------------------------------
%Assign the same index to the rows of grid sharing the same "ordering"
[~,~,equalorder] = unique([index1 index2 d1 d2],'rows', 'stable'); %sgx1
My question: is there a way to compute the algorithm in step 2 without the initial construction of the grid vectors in step 1? I am asking this because step 1 takes a lot of memory given that it basically generates the Cartesian product of 4 sets.
A solution should not rely on the specific content of U1grid and U2grid as that part changes in my actual code. To be more clear: U1grid and U2grid are ALWAYS derived from u11grid, ..., u21grid; however, the way in which they are derived from u11grid, ..., u21grid is slightly more complicated in my actual code from what I have reported here.
As Cris Luengo mentions in a comment, you're always going to be dealing with a trade-off between speed and memory. That said, one option you have is to only compute each of your 4 grid variables (u11grid u12grid u22grid u21grid) when needed instead of computing them once and storing them. You will save on memory but will lose speed if you are recomputing each one multiple times.
The solution I came up with involves creating an anonymous function equivalent for each of the 4 grid variables, using combinations of repmat and repelem to compute each individually instead of ndgrid to compute them all together:
u11gridFcn = #() repmat((-2:0.1:2).', 41.^3, 1);
u12gridFcn = #() repmat(repelem((-2:0.1:2).', 41), 41.^2, 1);
u22gridFcn = #() repmat(repelem((-2:0.1:2).', 41.^2), 41, 1);
u21gridFcn = #() repelem((-2:0.1:2).', 41.^3);
sg = 41.^4;
You would then use these by replacing every usage of your 4 grid variables in U1grid and U2grid with their corresponding function call. For your specific example above, this would be the new code for U1grid and U2grid (note also the use of inf(...) instead of Inf*ones(...), a small detail):
U1grid = [-u11gridFcn() ...
-u21gridFcn() ...
-u12gridFcn() ...
-u22gridFcn() ...
inf(sg, 1) ...
-inf(sg, 1)];
U2grid = [u21gridFcn()-u11gridFcn() ...
-u21gridFcn() ...
u22gridFcn()-u12gridFcn() ...
-u22gridFcn() ...
inf(sg, 1) ...
-inf(sg, 1)];
In this example, you avoid the memory needed to store the 4 grid variables, but the values for u11grid and u12grid will each be computed twice while the values for u21grid and u22grid will each be computed three times. Likely a small time trade-off for a potentially significant memory savings.
You may be able to remove the ndgrid, but it is not the memory bottleneck of this code, which is the call to unique on the large matrix A = [index1 index2 d1 d2]. The size of A is 2825761 by 22 (much larger than the grids), and it seems that unique may even internally copy A. I was able to avoid this call using
[sorted, ind] = sortrows([index1 index2 d1 d2]);
change = [1; any(diff(sorted), 2)];
uniqueInd = cumsum(change);
equalorder(ind) = uniqueInd;
[~, ~, equalorder] = unique(equalorder, 'stable');
where the last line is still the memory bottleneck and is only needed if you want the same numbering as your code produces. If any unique ordering is okay, you can skip it. You may be able to further reduce the memory footprint by carefully clearing variables are soon as they are no longer needed.

Fast way to compute row by row matrix correlation

I have two very large matrices (228453x460) and I want to compute correlation between rows.
for i=1:228453
if(vec1_preprocess(i,1))
for j=1:228453
df = effdf(vec1_preprocess(i,:)',vec2_preprocess(j,:)');
corr_temp = corr(vec1_preprocess(i,:)', vec2_preprocess(j,:)');
p = calculate_p(corr_temp, df);
temp = (meanVec(i)+p)/2;
meanVec(i) = temp;
end
disp(i);
end
end
This takes ~1day. Is there a direct way to compute this?
Edit: Code for effdf
function df = effdf(ts1,ts2);
%function df = effdf(ts1,ts2);
ts1=ts1-mean(ts1);
ts2=ts2-mean(ts2);
N=length(ts1);
ac1=xcorr(ts1);
ac1=ac1/max(ac1); % normalized autocorrelation
ac1=ac1(((length(ac1)+3)/2):((length(ac1)+3)/2+floor(N/4)));
ac2=xcorr(ts2);
ac2=ac2/max(ac2); % normalized autocorrelation
ac2=ac2(((length(ac2)+3)/2):((length(ac2)+3)/2+floor(N/4)));
df = 1/((1/N)+(2/N)*sum(((N-(1:length(ac1)))/N)'.*ac1.*ac2));
Since you didn't post the code, I assume that your custom functions calculate_p and effdf are perfectly optimized and don't represent the bottleneck of your script. Let's focus on what we have.
The first problem I see is:
if (vec1_preprocess(i,1))
A check over 228453 iterations can sensibly increase the running time. Hence, extract only the matrix rows that don't contain a 0 in the first column and perform your calculations on those:
idx = vec1_preprocess(:,1) ~= 0;
vec1_preprocess = vec1_preprocess(idx,:);
for i = 1:size(vec1_preprocess,1)
% ...
end
The second problem is corr. It seems like you are computing p-values also, using calculate_p. Why don't you use the buil-in p-values returned by the function as second output argument?
[c,p] = corr(A,B);
Alternatively, if Pearson's correlation is what you are looking for, you could replace corr with corrcoef to see if it produces a better performance.
Last but not least (in fact it's the most important thing): is there any reason why you are performing this computation row by row instead of running it on the whole matrices?
If you read the documentation, you'll see that corr computes the correlation between columns, not rows.
To convert rows into columns and columns into rows, simply transpose the matrix:
tmp1 = vec1_preprocess';
tmp2 = vec2_preprocess';
C = corr(tmp1,tmp2);

dynamically fill vector without assigning empty matrix

Oftentimes I need to dynamically fill a vector in Matlab. However this is sligtly annoying since you first have to define an empty variable first, e.g.:
[a,b,c]=deal([]);
for ind=1:10
if rand>.5 %some random condition to emphasize the dynamical fill of vector
a=[a, randi(5)];
end
end
a %display result
Is there a better way to implement this 'push' function, so that you do not have to define an empty vector beforehand? People tell me this is nonsensical in Matlab- if you think this is the case please explain why.
related: Push a variable in a vector in Matlab, is-there-an-elegant-way-to-create-dynamic-array-in-matlab
In MATLAB, pre-allocation is the way to go. From the docs:
for and while loops that incrementally increase the size of a data structure each time through the loop can adversely affect performance and memory use.
As pointed out in the comments by m7913d, there is a question on MathWorks' answers section which addresses this same point, read it here.
I would suggest "over-allocating" memory, then reducing the size of the array after your loop.
numloops = 10;
a = nan(numloops, 1);
for ind = 1:numloops
if rand > 0.5
a(ind) = 1; % assign some value to the current loop index
end
end
a = a(~isnan(a)); % Get rid of values which weren't used (and remain NaN)
No, this doesn't decrease the amount you have to write before your loop, it's even worse than having to write a = []! However, you're better off spending a few extra keystrokes and minutes writing well structured code than making that saving and having worse code.
It is (as for as I known) not possible in MATLAB to omit the initialisation of your variable before using it in the right hand side of an expression. Moreover it is not desirable to omit it as preallocating an array is almost always the right way to go.
As mentioned in this post, it is even desirable to preallocate a matrix even if the exact number of elements is not known. To demonstrate it, a small benchmark is desirable:
Ns = [1 10 100 1000 10000 100000];
timeEmpty = zeros(size(Ns));
timePreallocate = zeros(size(Ns));
for i=1:length(Ns)
N = Ns(i);
timeEmpty(i) = timeit(#() testEmpty(N));
timePreallocate(i) = timeit(#() testPreallocate(N));
end
figure
semilogx(Ns, timeEmpty ./ timePreallocate);
xlabel('N')
ylabel('time_{empty}/time_{preallocate}');
% do not preallocate memory
function a = testEmpty (N)
a = [];
for ind=1:N
if rand>.5 %some random condition to emphasize the dynamical fill of vector
a=[a, randi(5)];
end
end
end
% preallocate memory with the largest possible return size
function a = testPreallocate (N)
last = 0;
a = zeros(N, 1);
for ind=1:N
if rand>.5 %some random condition to emphasize the dynamical fill of vector
last = last + 1;
a(last) = randi(5);
end
end
a = a(1:last);
end
This figure shows how much time the method without preallocating is slower than preallocating a matrix based on the largest possible return size. Note that preallocating is especially important for large matrices due the the exponential behaviour.

Index exceeds matrix dimensions

X= [P(1,:,:);
P(2,:,:);
P(3,:,:)];
y= P(4:end,:);
indTrain = randperm(4798);
indTrain = indTrain(1:3838);
trainX= X(indTrain,:);
trainy = y(indTrain);
indTest = 3839:4798;
indTest(indTrain) = [];
testX = X(indTest,:);
testy = y(indTest);
It shows error in trainX= X(indTrain,:); saying
Index exceeds matrix dimensions
can anyone pls clarify ? thanks.
by the way I am having a 4x4798 data which my first 3 rows serve as predictors, and last row (4th row) is my response. how would i correctly split the data into the first 3838 columns as my training set and remaining as testing set.
Thanks..!!
Fix your error
To fix the indexing error you need to select the column indices of X, rather than the row indices:
trainX = X(:, indTrain );
Some words of advice
It seems like your P matrix is 4-by-4798 and it is two dimensional. Therefore, writing P(1,:,:) does select the first row, but it gives the impression as if P is three dimensional because of the extra : at the end. Don't do that. It's bad habit and makes your code harder to read/understand/debug.
X = P(1:3,:); % select all three rows at once
y = P(4,:); % no need for 4:end here - again, gives wrong impression as if you expect more than a single label per x.
Moreover, I do not understand what you are trying to accomplish with indTest(indTrain)=[]? Are you trying to ascertain that the train and test set are mutually exclusive?
This line will most likely cause an error since the size of your test set is 960 and indTrain contains 1:3838 (randomly permuted) so you will get "index exceeds..." error again.
You already defined your indTrain and indTest as mutually exclusive, no need for another operation. If you want to be extra careful you can use setdiff
indTest = setdiff( indTest, indTrain );

find indices in array, use indices as lookup, plot w/r/t time

I'm looking to find the n largest values in an array, then to use the indices of those found values as a look up into another array representing time. But I am wondering how I can plot this if i want time to display as a continuous variable. Do I need to zero out data? That wouldn't be preferable for my use case as I'm looking to save memory.
Let's say that I have array A, which is where I am looking for the max values. Then I have array T, which represents timestamps. I want my plot to display continuous time and plot() doesn't like arguments of differing size. How do most people deal with this?
Here's what I've got so far:
numtofind = 4;
A = m{:,10};
T = ((m{:,4} * 3600.0) + (m{:,5} * 60.0) + m{:,6});
[sorted, sortindex] = sort(A(:), 'descend');
maxvalues = sorted(1:numtofind);
maxindex = sortindex(1:numtofind);
corresponding_timestamps = T(maxindex);
%here i plot the max values against time/corresponding timestamps,
%but i want to place them in the right timestamp and display time as continuous
%rather than the filtered set:
plot(time_values, maxvalues);
When you say "time as continuous", do you mean you want time going from minimum to maximum? If so, you can just sort corresponding_timestamps and use that to reorder maxvalues. Even if you don't do that, you can still do plot(time_values, maxvalues, '.') to get a scatter plot which won't mess up your graph with lines.