Entropy of pure split caculated to NaN - matlab

I have written a function to calculate entropy of a vector where each element represents number of elements of a class.
function x = Entropy(a)
t = sum(a);
t = repmat(t, [1, size(a, 2)]);
x = sum(-a./t .* log2(a./t));
end
e.g: a = [4 0], then entropy = -(0/4)*log2(0/4) - (4/4)*log2(4/4)
But for above function, the entropy is NaN when the split is pure because of log2(0), as in above example. The entropy of pure split should be zero.
How should I solve the problem with least effect on performance as data is very large? Thanks

I would suggest you create your own log2 function
function res=mylog2(a)
res=log2(a);
res(isinf(res))=0;
end
This function, while breaking the log2 behaviour, can be used in your specific example because you are multiplying the result with the inside of the log, thus making it zero. It is not "mathematically correct", but I believe that's what you are looking for.

Related

Implementing a Function using for-loops and matrix multiplication in matlab

My goal is to implement a function which performs fourier synthesis in matlab, as part of learning the language. The function implements the following expression:
y = sum(ak*exp((i*2*pi*k*t)/T)
where k is the index, ak is a vector of fourier coefficients, t is a time vector of sampled times, and T is the period of the signal.
I have tried something like this:
for counter = -N:1:N
k = y+N+1;
y(k) = ak(k)*exp((i*2*pi*k*t)/T);
% y is a vector of length 2N+1
end
However, this gives me an error that the sides do not have equal numbers of items within them. This makes sense to me, since t is a vector of arbitrary length, and thus I am trying to make y(k) equal to numerous things rather than one thing. Instead, I suspect I need to try something like:
for counter = -N:1:N
k=y+N+1;
for t = 0:1/fs:1
%sum over t elements for exponential operation
end
%sum over k elements to generate y(k)
end
However, I'm supposedly able to implement this using purely matrix multiplication. How could I do this? I've tried to wrap my head around what Matlab is doing, but honestly, it's so far from the other languages I know that I don't really have any sense of what matlab's doing under the hood. Understanding how to change between operations on matrices and operations in for loops would be profoundly helpful.
You can use kron to reach your goal without for loops, i.e., matrix representation:
y = a*exp(1j*2*pi*kron(k.',t)/T);
where a,k and t are all assumed as row-vectors
Example
N = 3;
k = -N:N;
t = 1:0.5:5;
T = 15;
a = 1:2*N+1;
y = a*exp(1j*2*pi*kron(k.',t)/T);
such that
y =
Columns 1 through 6:
19.1335 + 9.4924i 10.4721 + 10.6861i 2.0447 + 8.9911i -4.0000 + 5.1962i -6.4721 + 0.7265i -5.4611 - 2.8856i
Columns 7 through 9:
-2.1893 - 4.5489i 1.5279 - 3.9757i 4.0000 - 1.7321i

How to vectorize Matlab Code with mvnpdf in?

I have some working code in matlab, and speed is vital. I have vectorized/optimized many parts of it, and the profiler now tells me that the most time is spent a short piece of code. For this,
I have some parameter sets for a multi-variate normal
distribution.
I then have to get the value from the corresponding PDF at some point
pos,
and multiply it by some other value stored in a vector.
I have produced a minimal working example below:
num_params = 1000;
prob_dist_params = repmat({ [1, 2], [10, 1; 1, 5] }, num_params, 1);
saved_nu = rand( num_params, 1 );
saved_pos = rand( num_params, 2 );
saved_total = 0;
tic()
for param_counter = 1:size(prob_dist_params)
% Evaluate the PDF at specified points
pdf_vals = mvnpdf( saved_pos(param_counter,:), prob_dist_params{param_counter,1}, prob_dist_params{param_counter, 2} );
saved_total = saved_total + saved_nu(param_counter)*pdf_vals;
end % End of looping over parameters
toc()
I am aware that prob_dist_params are all the same in this case, but in my code we have each element of this different depending on a few things upstream. I call this particular piece of code many tens of thousands of time in my full program, so am wondering if there is anything at all I can do to vectorize this loop, or failing that, speed it up at all? I do not know how to do so with the inclusion of a mvnpdf() function.
Yes you can, however, I don't think it will give you a huge performance boost. You will have to reshape your mu's and sigma's.
Checking the doc of mvnpdf(X,mu,sigma), you see that you will have to provide X and mu as n-by-d numeric matrix and sigma as d-by-d-by-n.
In your case, d is 2 and n is 1000. You have to split the cell array in two matrices, and reshape as follows:
prob_dist_mu = cell2mat(prob_dist_params(:,1));
prob_dist_sigma = cell2mat(permute(prob_dist_params(:,2),[3 2 1]));
With permute, I make the first dimension of the cell array the third dimension, so cell2mat will result in a 2-by-2-by-1000 matrix. Alternatively you can define them as follows,
prob_dist_mu = repmat([1 2], [num_params 1]);
prob_dist_sigma = repmat([10, 1; 1, 5], [1 1 num_params]);
Now call mvnpdf with
pdf_vals = mvnpdf(saved_pos, prob_dist_mu, prob_dist_sigma);
saved_total = saved_nu.'*pdf_vals; % simple dot product

Optimize nested for loop for calculating xcorr of matrix rows

I have 2 nested loops which do the following:
Get two rows of a matrix
Check if indices meet a condition or not
If they do: calculate xcorr between the two rows and put it into new vector
Find the index of the maximum value of sub vector and replace element of LAG matrix with this value
I dont know how I can speed this code up by vectorizing or otherwise.
b=size(data,1);
F=size(data,2);
LAG= zeros(b,b);
for i=1:b
for j=1:b
if j>i
x=data(i,:);
y=data(j,:);
d=xcorr(x,y);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(i,j)=I-1;
d=xcorr(y,x);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(j,i)=I-1;
end
end
end
First, a note on floating point precision...
You mention in a comment that your data contains the integers 0, 1, and 2. You would therefore expect a cross-correlation to give integer results. However, since the calculation is being done in double-precision, there appears to be some floating-point error introduced. This error can cause the results to be ever so slightly larger or smaller than integer values.
Since your calculations involve looking for the location of the maxima, then you could get slightly different results if there are repeated maximal integer values with added precision errors. For example, let's say you expect the value 10 to be the maximum and appear in indices 2 and 4 of a vector d. You might calculate d one way and get d(2) = 10 and d(4) = 10.00000000000001, with some added precision error. The maximum would therefore be located in index 4. If you use a different method to calculate d, you might get d(2) = 10 and d(4) = 9.99999999999999, with the error going in the opposite direction, causing the maximum to be located in index 2.
The solution? Round your cross-correlation data first:
d = round(xcorr(x, y));
This will eliminate the floating-point errors and give you the integer results you expect.
Now, on to the actual solutions...
Solution 1: Non-loop option
You can pass a matrix to xcorr and it will perform the cross-correlation for every pairwise combination of columns. Using this, you can forego your loops altogether like so:
d = round(xcorr(data.'));
[~, I] = max(d(F:(2*F)-1,:), [], 1);
LAG = reshape(I-1, b, b).';
Solution 2: Improved loop option
There are limits to how large data can be for the above solution, since it will produce large intermediate and output variables that can exceed the maximum array size available. In such a case for loops may be unavoidable, but you can improve upon the for-loop solution above. Specifically, you can compute the cross-correlation once for a pair (x, y), then just flip the result for the pair (y, x):
% Loop over rows:
for row = 1:b
% Loop over upper matrix triangle:
for col = (row+1):b
% Cross-correlation for upper triangle:
d = round(xcorr(data(row, :), data(col, :)));
[~, I] = max(d(:, F:(2*F)-1));
LAG(row, col) = I-1;
% Cross-correlation for lower triangle:
d = fliplr(d);
[~, I] = max(d(:, F:(2*F)-1));
LAG(col, row) = I-1;
end
end

convolution with bsxfun instead of loops in Matlab

I want to replace the for loops with bsxfun to calculate convolution in Matlab.
Following is the script:
for Rx = 1:Num_Rx
for Tx= 1:Num_Tx
Received(Rx,:)=Received(Rx,:)+conv(squeeze(channel(Rx,Tx,:))', Transmitted(Tx,:));
end
end
% Received is a Num_Rx by N matrix, Transmitted is a Num_Tx by N matrix and channel is a 3D matrix with dimension Num_Rx, Num_Tx, N.
When I changed code as:
Received = bsxfun(#plus, Received, bsxfun(#conv, permute(squeeze(channel), [3 1 2]), Transmitted));
Error came out, which said "two non-single-dimension of input arrays must be matched".
How could I correct this line? Thanks a lot!
Why do you want to replace the loops with bsxfun? If the sizes involved in the convolution aren't particularly small, then the convolution is going to take up most of your overhead and the difference between loops and some vectorized version of this call is going to be minimal.
One option you have, if you can afford the temporary storage and it doesn't mess with your numerics too much, is to use the FFT to do this convolution instead. That would look something like
Transmitted = reshape(Transmitted, [1 Num_Tx size(Transmitted, 2)]);
N = size(Transmitted, 3) + size(channel, 3) - 1;
Received = ifft(fft(channel, N, 3).*fft(Transmitted, N, 3), N, 3);
Received = squeeze(sum(Received, 2));

z score with nan values in matlab (vectorized)

I am trying to calculate the zscore for a vector of 5000 rows which has many nan values. I have to calculate this many times so I dont want to use a loop, I was hoping to find a vectorized solution.
the loop solution:
for i = 1:end
vec(i,1) = (val(i,1) - nanmean(:,1))/nanstd(:,1)
end
a partial vectorized solution:
zscore(vec(find(isnan(vec(1:end) == 0))))
but this returns a vector the length of the original vector minus the nan values. Thus it isn't the same as the original size.
I want to calculated the zscore for the vector and then interpolate missing data after words. I have to do this 100s of times thus I am looking for a fast vectorized approach.
This is a vectorized solution:
% generate some example data with NaNs.
val = reshape(magic(4), 16, 1);
val(10) = NaN;
val(17) = NaN;
Here's the code:
valWithoutNaNs = val(~isnan(val));
valMean = mean(valWithoutNaNs);
valSD = std(valWithoutNaNs);
valZscore = (val-valMean)/valSD;
Then column vector valZscore contains deviations (Z scores), and has NaN values for NaN values in val, the original measurement data.
Sorry this answer is 6 months late, but for anyone else who comes across this thread:
The accepted answer isn't fully vectorised in that it doesn't do what the real zscore does so beautifully: That is, do zscores along a particular dimension of a matrix.
If you want to calculate zscores of a large number of vectors at once, as the OP says he is doing, the best solution is this:
Z = bsxfun(#divide, bsxfun(#minus, X, nanmean(X)) ,
nanstd(X) );
To do it on an arbitrary dimension, just put the dimension inside the nanmean and nanstd, and bsxfun takes care of the rest.
nanzscore = #(X,DIM) bsxfun(#divide, bsxfun(#minus, X, nanmean(X,DIM)), ...
nanstd(X,DIM));
anonymous function:
nanZ = #(xIn)(xIn-nanmean(xIn))/nanstd(xIn);
nanZ(vectorWithNans)
vectorized version of below anonymous function (assumes observations are in rows, variables in columns):
nanZ = #(xIn)(xIn-repmat(nanmean(xIn),size(xIn,1),1))./repmat(nanstd(xIn),size(xIn,1),1);
nanZ(matrixWithNans)