How to neatly sum values of histogram bins, given a value matrix and an associated category association matrix (bin association) - matlab

Consider the example following below, where I have a 10x10 matrix, say A, of random values in some range, say [-5, 5]. I quantize the values of A into 8 categories, 1, ..., 8, such that an additional 10x10 matrix, say qA, describes the category association for each number in A. Finally, I produce the sums of all values assigned to each category. My question regards this final step.
myRange = 5; % values in open interval [-myRange, myRange]
A = myRange*(2*rand(10) - 1);
qA = uencode(A, 3, myRange)+1;
% (+) create "histogram" of sum of values assigned to each bin
myHistogram = zeros(8,1);
for i = 1:numel(A)
myHistogram(qA(i)) = myHistogram(qA(i)) + A(i);
end
bar(myHistogram)
Question: Is the some neater way of doing this, specifically the counting step (+) above? (Some better alternative than explicitly iterating over each element in the matrix A?).

Just as I was about to finish up and post my question I found a satisfying answer to it, however not here on SO. As self-answering is encouraged, I'll post the Q+A instead of aborting this Q posting.
Hence, based on the following Matlab Central thread, one neater solution is as follows:
myRange = 5; % values in open interval [-myRange, myRange]
A = myRange*(2*rand(10) - 1);
qA = uencode(A, 3, myRange)+1;
% or, if you dont have Signal Processing Toolbox required for 'uencode'
% [~, ~, qA] = histcounts(A, -myRange:myRange/4:myRange);
% (+) create "histogram" of sum of values assigned to each bin
myHistogram = accumarray(qA(:), A(:), [8 1])
Possibly there's alternative/even better ways to do this, performing quantizing and bin value summation in same step?

Related

How to vectorize Matlab Code with mvnpdf in?

I have some working code in matlab, and speed is vital. I have vectorized/optimized many parts of it, and the profiler now tells me that the most time is spent a short piece of code. For this,
I have some parameter sets for a multi-variate normal
distribution.
I then have to get the value from the corresponding PDF at some point
pos,
and multiply it by some other value stored in a vector.
I have produced a minimal working example below:
num_params = 1000;
prob_dist_params = repmat({ [1, 2], [10, 1; 1, 5] }, num_params, 1);
saved_nu = rand( num_params, 1 );
saved_pos = rand( num_params, 2 );
saved_total = 0;
tic()
for param_counter = 1:size(prob_dist_params)
% Evaluate the PDF at specified points
pdf_vals = mvnpdf( saved_pos(param_counter,:), prob_dist_params{param_counter,1}, prob_dist_params{param_counter, 2} );
saved_total = saved_total + saved_nu(param_counter)*pdf_vals;
end % End of looping over parameters
toc()
I am aware that prob_dist_params are all the same in this case, but in my code we have each element of this different depending on a few things upstream. I call this particular piece of code many tens of thousands of time in my full program, so am wondering if there is anything at all I can do to vectorize this loop, or failing that, speed it up at all? I do not know how to do so with the inclusion of a mvnpdf() function.
Yes you can, however, I don't think it will give you a huge performance boost. You will have to reshape your mu's and sigma's.
Checking the doc of mvnpdf(X,mu,sigma), you see that you will have to provide X and mu as n-by-d numeric matrix and sigma as d-by-d-by-n.
In your case, d is 2 and n is 1000. You have to split the cell array in two matrices, and reshape as follows:
prob_dist_mu = cell2mat(prob_dist_params(:,1));
prob_dist_sigma = cell2mat(permute(prob_dist_params(:,2),[3 2 1]));
With permute, I make the first dimension of the cell array the third dimension, so cell2mat will result in a 2-by-2-by-1000 matrix. Alternatively you can define them as follows,
prob_dist_mu = repmat([1 2], [num_params 1]);
prob_dist_sigma = repmat([10, 1; 1, 5], [1 1 num_params]);
Now call mvnpdf with
pdf_vals = mvnpdf(saved_pos, prob_dist_mu, prob_dist_sigma);
saved_total = saved_nu.'*pdf_vals; % simple dot product

Optimize nested for loop for calculating xcorr of matrix rows

I have 2 nested loops which do the following:
Get two rows of a matrix
Check if indices meet a condition or not
If they do: calculate xcorr between the two rows and put it into new vector
Find the index of the maximum value of sub vector and replace element of LAG matrix with this value
I dont know how I can speed this code up by vectorizing or otherwise.
b=size(data,1);
F=size(data,2);
LAG= zeros(b,b);
for i=1:b
for j=1:b
if j>i
x=data(i,:);
y=data(j,:);
d=xcorr(x,y);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(i,j)=I-1;
d=xcorr(y,x);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(j,i)=I-1;
end
end
end
First, a note on floating point precision...
You mention in a comment that your data contains the integers 0, 1, and 2. You would therefore expect a cross-correlation to give integer results. However, since the calculation is being done in double-precision, there appears to be some floating-point error introduced. This error can cause the results to be ever so slightly larger or smaller than integer values.
Since your calculations involve looking for the location of the maxima, then you could get slightly different results if there are repeated maximal integer values with added precision errors. For example, let's say you expect the value 10 to be the maximum and appear in indices 2 and 4 of a vector d. You might calculate d one way and get d(2) = 10 and d(4) = 10.00000000000001, with some added precision error. The maximum would therefore be located in index 4. If you use a different method to calculate d, you might get d(2) = 10 and d(4) = 9.99999999999999, with the error going in the opposite direction, causing the maximum to be located in index 2.
The solution? Round your cross-correlation data first:
d = round(xcorr(x, y));
This will eliminate the floating-point errors and give you the integer results you expect.
Now, on to the actual solutions...
Solution 1: Non-loop option
You can pass a matrix to xcorr and it will perform the cross-correlation for every pairwise combination of columns. Using this, you can forego your loops altogether like so:
d = round(xcorr(data.'));
[~, I] = max(d(F:(2*F)-1,:), [], 1);
LAG = reshape(I-1, b, b).';
Solution 2: Improved loop option
There are limits to how large data can be for the above solution, since it will produce large intermediate and output variables that can exceed the maximum array size available. In such a case for loops may be unavoidable, but you can improve upon the for-loop solution above. Specifically, you can compute the cross-correlation once for a pair (x, y), then just flip the result for the pair (y, x):
% Loop over rows:
for row = 1:b
% Loop over upper matrix triangle:
for col = (row+1):b
% Cross-correlation for upper triangle:
d = round(xcorr(data(row, :), data(col, :)));
[~, I] = max(d(:, F:(2*F)-1));
LAG(row, col) = I-1;
% Cross-correlation for lower triangle:
d = fliplr(d);
[~, I] = max(d(:, F:(2*F)-1));
LAG(col, row) = I-1;
end
end

Assign labels based on given examples for a large dataset effectively

I have matrix X (100000 X 10) and vector Y (100000 X 1). X rows are categorical and assume values 1 to 5, and labels are categorical too (11 to 20);
The rows of X are repetitive and there are only ~25% of unique rows, I want Y to have statistical mode of all the labels for a particular unique row.
And then there comes another dataset P (90000 X 10), I want to predict labels Q based on the previous exercise.
What I tried is finding unique rows of X using unique in MATLAB, and then assign statistical mode of each of these labels for the unique rows. For P, I can use ismember and carry out the same.
The issue is in the size of the dataset and it takes an 1.5-2 hours to complete the process. Is there a vectorize version possible in MATLAB?
Here is my code:
[X_unique,~,ic] = unique(X,'rows','stable');
labels=zeros(length(X_unique),1);
for i=1:length(X_unique)
labels(i)=mode(Y(ic==i));
end
Q=zeros(length(P),1);
for j=1:length(X_unique)
Q(all(repmat(X_unique(j,:),length(P),1)==P,2))=label(j);
end
You will be able to accelerate your first loop a great deal if you replace it entirely with:
labels = accumarray(ic, Y, [], #(y) mode(y));
The second loop can be accelerated by using all(bsxfun(#eq, X_unique(i,:), P), 2) inside Q(...). This is a good vectorized approach assuming your arrays are not extremely large w.r.t. the available memory on your machine. In addition, to save more time, you could use the unique trick you did with X on P, run all the comparisons on a much smaller array:
[P_unique, ~, IC_P] = unique(P, 'rows', 'stable');
EDIT:
to compute Q_unique in the following way: and then convert it back to the full array using:
Q_unique = zeros(length(P_unique),1);
for i = 1:length(X_unique)
Q_unique(all(bsxfun(#eq, X_unique(i,:), P_unique), 2)) = labels(i)
end
and convert back to Q_full to match the original P input:
Q_full = Q_unique(IC_P);
END EDIT
Finally, if memory is an issue, in addition to everything above, you might want you use a semi-vectorized approach inside your second loop:
for i = 1:length(X_unique)
idx = true(length(P), 1);
for j = 1:size(X_unique,2)
idx = idx & (X_unique(i,j) == P(:,j));
end
Q(idx) = labels(i);
% Q(all(bsxfun(#eq, X_unique(i,:), P), 2)) = labels(i);
end
This would take about x3 longer compared with bsxfun but if memory is limited then you gotta pay with speed.
ANOTHER EDIT
Depending on your version of Matlab, you could also use containers.Map to your advantage by mapping textual representations of the numeric sequences to the calculated labels. See example below.
% find unique members of X to work with a smaller array
[X_unique, ~, IC_X] = unique(X, 'rows', 'stable');
% compute labels
labels = accumarray(IC_X, Y, [], #(y) mode(y));
% convert X to cellstr -- textual representation of the number sequence
X_cellstr = cellstr(char(X_unique+48)); % 48 is ASCII for 0
% map each X to its label
X_map = containers.Map(X_cellstr, labels);
% find unique members of P to work with a smaller array
[P_unique, ~, IC_P] = unique(P, 'rows', 'stable');
% convert P to cellstr -- textual representation of the number sequence
P_cellstr = cellstr(char(P_unique+48)); % 48 is ASCII for 0
% --- EDIT --- avoiding error on missing keys in X_map --------------------
% find which P's exist in map
isInMapP = X_map.isKey(P_cellstr);
% pre-allocate Q_unique to the size of P_unique (can be any value you want)
Q_unique = nan(size(P_cellstr)); % NaN is safe to use since not a label
% find the labels for each P_unique that exists in X_map
Q_unique(isInMapP) = cell2mat(X_map.values(P_cellstr(isInMapP)));
% --- END EDIT ------------------------------------------------------------
% convert back to full Q array to match original P
Q_full = Q_unique(IC_P);
This takes about 15 seconds to run on my laptop. Most of which is consumed by computation of mode.

How to see resampled data after BOOTSTRAP

I was trying to resample (with replacement) my database using 'bootstrap' in Matlab as follows:
D = load('Data.txt');
lead = D(:,1);
depth = D(:,2);
X = D(:,3);
Y = D(:,4);
%Bootstraping to resample 100 times
[resampling100,bootsam] = bootstrp(100,'corr',lead,depth);
%plottig the bootstraping result as histogram
hist(resampling100,10);
... ... ...
... ... ...
Though the script written above is correct, I wonder how I would be able to see/load the resampled 100 datasets created through bootstrap? 'bootsam(:)' display the indices of the data/values selected for the bootstrap samples, but not the new sample values!! Isn't it funny that I'm creating fake data from my original data and I can't even see what is created behind the scene?!?
My second question: is it possible to resample the whole matrix (in this case, D) altogether without using any function? However, I know how to create random values from a vector data using 'unidrnd'.
Thanks in advance for your help.
The answer to question 1 is that bootsam provides the indices of the resampled data. Specifically, the nth column of bootsam provides the indices of the nth resampled dataset. In your case, to obtain the nth resampled dataset you would use:
lead_resample_n = lead(bootsam(:, n));
depth_resample_n = depth(bootsam(:, n));
Regarding the second question, I'm guessing what you mean is, how would you just get a re-sampled dataset without worrying about applying a function to the resampled data. Personally, I would use randi, but in this situation, it is irrelevant whether you use randi or unidrnd. An example follows that assumes 4 columns of some data matrix D (as in your question):
%# Build an example dataset
T = 10;
D = randn(T, 4);
%# Obtain a set of random indices, ie indices of draws with replacement
Ind = randi(T, T, 1);
%# Obtain the resampled data
DResampled = D(Ind, :);
To create multiple re-sampled data, you can simply loop over the creation of random indices. Or you could do it in one step by creating a matrix of random indices and using that to index D. With careful use of reshape and permute you can turn this into a T*4*M array, where indexing m = 1, ..., M along the third dimension yields the mth resampled dataset. Example code follows:
%# Build an example dataset
T = 10;
M = 3;
D = randn(T, 4);
%# Obtain a set of random indices, ie indices of draws with replacement
Ind = randi(T, T, M);
%# Obtain the resampled data
DResampled = permute(reshape(D(Ind, :)', 4, T, []), [2 1 3]);

Update only one matrix element for iterative computation

I have a 3x3 matrix, A. I also compute a value, g, as the maximum eigen value of A. I am trying to change the element A(3,3) = 0 for all values from zero to one in 0.10 increments and then update g for each of the values. I'd like all of the other matrix elements to remain the same.
I thought a for loop would be the way to do this, but I do not know how to update only one element in a matrix without storing this update as one increasingly larger matrix. If I call the element at A(3,3) = p (thereby creating a new matrix Atry) I am able (below) to get all of the values from 0 to 1 that I desired. I do not know how to update Atry to get all of the values of g that I desire. The state of the code now will give me the same value of g for all iterations, as expected, as I do not know how to to update Atry with the different values of p to then compute the values for g.
Any suggestions on how to do this or suggestions for jargon or phrases for me to web search would be appreciated.
A = [1 1 1; 2 2 2; 3 3 0];
g = max(eig(A));
% This below is what I attempted to achieve my solution
clear all
p(1) = 0;
Atry = [1 1 1; 2 2 2; 3 3 p];
g(1) = max(eig(Atry));
for i=1:100;
p(i+1) = p(i)+ 0.01;
% this makes a one giant matrix, not many
%Atry(:,i+1) = Atry(:,i);
g(i+1) = max(eig(Atry));
end
This will also accomplish what you want to do:
A = #(x) [1 1 1; 2 2 2; 3 3 x];
p = 0:0.01:1;
g = arrayfun(#(x) eigs(A(x),1), p);
Breakdown:
Define A as an anonymous function. This means that the command A(x) will return your matrix A with the (3,3) element equal to x.
Define all steps you want to take in vector p
Then "loop" through all elements in p by using arrayfun instead of an actual loop.
The function looped over by arrayfun is not max(eig(A)) but eigs(A,1), i.e., the 1 largest eigenvalue. The result will be the same, but the algorithm used by eigs is more suited for your type of problem -- instead of computing all eigenvalues and then only using the maximum one, you only compute the maximum one. Needless to say, this is much faster.
First, you say 0.1 increments in the text of your question, but your code suggests you are actually interested in 0.01 increments? I'm going to operate under the assumption you mean 0.01 increments.
Now, with that out of the way, let me state what I believe you are after given my interpretation of your question. You want to iterate over the matrix A, where for each iteration you increase A(3, 3) by 0.01. Given that you want all values from 0 to 1, this implies 101 iterations. For each iteration, you want to calculate the maximum eigenvalue of A, and store all these eigenvalues in some vector (which I will call gVec). If this is correct, then I believe you just want the following:
% Specify the "Current" A
CurA = [1 1 1; 2 2 2; 3 3 0];
% Pre-allocate the values we want to iterate over for element (3, 3)
A33Vec = (0:0.01:1)';
% Pre-allocate a vector to store the maximum eigenvalues
gVec = NaN * ones(length(A33Vec), 1);
% Loop over A33Vec
for i = 1:1:length(A33Vec)
% Obtain the version of A that we want for the current i
CurA(3, 3) = A33Vec(i);
% Obtain the maximum eigen value of the current A, and store in gVec
gVec(i, 1) = max(eig(CurA));
end
EDIT: Probably best to paste this code into your matlab editor. The stack-overflow automatic text highlighting hasn't done it any favors :-)
EDIT: Go with Rody's solution (+1) - it is much better!