Mutual Information of MATLAB Matrix - matlab

I have a square matrix that represents the frequency counts of co-occurrences in a data set. In other words, the rows represent all possible observations of feature 1, and the columns are the possible observations of feature 2. The number in cell (x, y) is the number of times feature 1 was observed to be x at the same time feature 2 was y.
I want to calculate the mutual information contained in this matrix. MATLAB has a built-in information function, but it takes 2 arguments, one for x and one for y. How would I manipulate this matrix to get the arguments it expects?
Alternatively, I wrote my own mutual information function that takes a matrix, but I'm unsure about its accuracy. Does it look right?
function [mutualinfo] = mutualInformation(counts)
total = sum(counts(:));
pX = sum(counts, 1) ./ total;
pY = sum(counts) ./ total;
pXY = counts ./ total;
[h, w] = size(counts);
mutualinfo = 0;
for row = 1:h
for col = 1:w
mutualinfo = mutualinfo + pXY(row, col) * log(pXY(row, col) / (pX(row)*pY(col)));
end;
end;
end

I don't know of any built-in mutual information functions in MATLAB. Perhaps you got a hold of one of the submissions from the MathWorks File Exchange or some other third-party developer code?
I think there may be something wrong with how you are computing pX and pY. Plus, you can vectorize your operations instead of using for loops. Here's another version of your function to try out:
function mutualInfo = mutualInformation(counts)
pXY = counts./sum(counts(:));
pX = sum(pXY,2);
pY = sum(pXY,1);
mutualInfo = pXY.*log(pXY./(pX*pY));
mutualInfo = sum(mutualInfo(:));
end

Related

Which Bins are occupied in a 3D histogram in MatLab

I got 3D data, from which I need to calculate properties.
To reduce computung I wanted to discretize the space and calculate the properties from the Bin instead of the individual data points and then reasign the propertie caclulated from the bin back to the datapoint.
I further only want to calculate the Bins which have points within them.
Since there is no 3D-binning function in MatLab, what i do is using histcounts over each dimension and then searching for the unique Bins that have been asigned to the data points.
a5pre=compositions(:,1);
a7pre=compositions(:,2);
a8pre=compositions(:,3);
%% BINNING
a5pre_edges=[0,linspace(0.005,0.995,19),1];
a5pre_val=(a5pre_edges(1:end-1) + a5pre_edges(2:end))/2;
a5pre_val(1)=0;
a5pre_val(end)=1;
a7pre_edges=[0,linspace(0.005,0.995,49),1];
a7pre_val=(a7pre_edges(1:end-1) + a7pre_edges(2:end))/2;
a7pre_val(1)=0;
a7pre_val(end)=1;
a8pre_edges=a7pre_edges;
a8pre_val=a7pre_val;
[~,~,bin1]=histcounts(a5pre,a5pre_edges);
[~,~,bin2]=histcounts(a7pre,a7pre_edges);
[~,~,bin3]=histcounts(a8pre,a8pre_edges);
bins=[bin1,bin2,bin3];
[A,~,C]=unique(bins,'rows','stable');
a5pre=a5pre_val(A(:,1));
a7pre=a7pre_val(A(:,2));
a8pre=a8pre_val(A(:,3));
It seems like that the unique function is pretty time consuming, so I was wondering if there is a faster way to do it, knowing that the line only can contain integer or so... or a totaly different.
Best regards
function [comps,C]=compo_binner(x,y,z,e1,e2,e3,v1,v2,v3)
C=NaN(length(x),1);
comps=NaN(length(x),3);
id=1;
for i=1:numel(x)
B_temp(1,1)=v1(sum(x(i)>e1));
B_temp(1,2)=v2(sum(y(i)>e2));
B_temp(1,3)=v3(sum(z(i)>e3));
C_id=sum(ismember(comps,B_temp),2)==3;
if sum(C_id)>0
C(i)=find(C_id);
else
comps(id,:)=B_temp;
id=id+1;
C_id=sum(ismember(comps,B_temp),2)==3;
C(i)=find(C_id>0);
end
end
comps(any(isnan(comps), 2), :) = [];
end
But its way slower than the histcount, unique version. Cant avoid find-function, and thats a function you sure want to avoid in a loop when its about speed...
If I understand correctly you want to compute a 3D histogram. If there's no built-in tool to compute one, it is simple to write one:
function [H, lindices] = histogram3d(data, n)
% histogram3d 3D histogram
% H = histogram3d(data, n) computes a 3D histogram from (x,y,z) values
% in the Nx3 array `data`. `n` is the number of bins between 0 and 1.
% It is assumed all values in `data` are between 0 and 1.
assert(size(data,2) == 3, 'data must be Nx3');
H = zeros(n, n, n);
indices = floor(data * n) + 1;
indices(indices > n) = n;
lindices = sub2ind(size(H), indices(:,1), indices(:,2), indices(:,3));
for ii = 1:size(data,1)
H(lindices(ii)) = H(lindices(ii)) + 1;
end
end
Now, given your compositions array, and binning each dimension into 20 bins, we get:
[H, indices] = histogram3d(compositions, 20);
idx = find(H);
[x,y,z] = ind2sub(size(H), idx);
reduced_compositions = ([x,y,z] - 0.5) / 20;
The bin centers for H are at ((1:20)-0.5)/20.
On my machine this runs in a fraction of a second for 5 million inputs points.
Now, for each composition(ii,:), you have a number indices(ii), which matches with another number idx[jj], corresponding to reduced_compositions(jj,:). One easy way to make the assignment of results is as follows:
H(H > 0) = 1:numel(idx);
indices = H(indices);
Now for each composition(ii,:), your closest match in the reduced set is reduced_compositions(indices(ii),:).

Metropolis-Hastings in matlab

I am trying to use the Metropolis Hastings algorithm with a random walk sampler to simulate samples from a function $$ in matlab, but something is wrong with my code. The proposal density is the uniform PDF on the ellipse 2s^2 + 3t^2 ≤ 1/4. Can I use the acceptance rejection method to sample from the proposal density?
N=5000;
alpha = #(x1,x2,y1,y2) (min(1,f(y1,y2)/f(x1,x2)));
X = zeros(2,N);
accept = false;
n = 0;
while n < 5000
accept = false;
while ~accept
s = 1-rand*(2);
t = 1-rand*(2);
val = 2*s^2 + 3*t^2;
% check acceptance
accept = val <= 1/4;
end
% and then draw uniformly distributed points checking that u< alpha?
u = rand();
c = u < alpha(X(1,i-1),X(2,i-1),X(1,i-1)+s,X(2,i-1)+t);
X(1,i) = c*s + X(1,i-1);
X(2,i) = c*t + X(2,i-1);
n = n+1;
end
figure;
plot(X(1,:), X(2,:), 'r+');
You may just want to use the native implementation of matlab mhsample.
Regarding your code, there are a few things missing:
- function alpha,
- loop variable i (it might be just n but it is not suited for indexing since it starts at zero).
And you should always allocate memory in matlab if you want to fill it dynamically, i.e. X in your case.
To expand on the suggestions by #max, the code appears to work if you change the i indices to n and replace
n = 0;
with
n = 2;
X(:,1) = [.1,.1];
It would probably be better to assign X(:,1) to random values within your accept region (using the same code you use later), and/or include a burn-in period.
Depending upon what you are going to do with this, it may also make things cleaner to evaluate the argument to sin in the f function to keep it within 0 to 2 pi (likely by shifting the value by 2 pi if it exceeds those bounds)

How to make a graph from function output in matlab

I'm completely lost at this using MATLAB functions, so here is the case:
lets assume I have SUM=0, and
I have a constant probability P that the user gives me, and I have to compare this constant P, with other M (also user gives M) random probabilities, if P is larger I add 1 to SUM, if P is smaller I add -1 to SUM... and at the end I want print on the screen the graph of the process.
I managed till now to make only one stage with this code:
function [result] = ex1(p)
if (rand>=p) result=1;
else result=-1;
end
(its like M=1)
How do You suggest I can modify this code in order to make it work the way I described it before (including getting a graph) ?
Or maybe I'm getting the logic wrong? the question says I get 1 with probability P, and -1 with probability (1-P), and the SUM is the same
Many thanks
I'm not sure how you achieve your input, but this should get you on the way:
p = 0.5; % Constant probability
m = 10;
randoms = rand(m,1) % Random probabilities
results = ones(m,1);
idx = find(randoms < p)
results(idx) = -1;
plot(cumsum(results))
For m = 1000:
You can do it like this:
p = 0.25; % example data
M = 20; % example data
random = rand(M,1); % generate values
y = cumsum(2*(random>=p)-1); % compute cumulative sum of +1/-1
plot(y) % do the plot
The important function here is cumsum, which does the cumulative sum on the sequence of +1/-1 values generated by 2*(random>=p)-1.
Example graph with p=0.5, M=2000:

How can I plot data to a “best fit” cos² graph in Matlab?

I’m currently a Physics student and for several weeks have been compiling data related to ‘Quantum Entanglement’. I’ve now got to a point where I have to plot my data (which should resemble a cos² graph - and does) to a sort of “best fit” cos² graph. The lab script says the following:
A more precise determination of the visibility V (this is basically how 'clean' the data is) follows from the best fit to the measured data using the function:
f(b) = A/2[1-Vsin(b-b(center)/P)]
Granted this probably doesn’t mean much out of context, but essentially A is the amplitude, b is an angle and P is the periodicity. Hence this is also a “wave” like the experimental data I have found.
From this I understand, as previously mentioned, I am making a “best fit” curve. However, I have been told that this isn’t possible with Excel and that the best approach is Matlab.
I know intermediate JavaScript but do not know Matlab and was hoping for some direction.
Is there a tutorial I can read for this? Is it possible for someone to go through it with me? I really have no idea what it entails, so any feed back would be greatly appreciated.
Thanks a lot!
Initial steps
I guess we should begin by getting a representation in Matlab of the function that you're trying to model. A direct translation of your formula looks like this:
function y = targetfunction(A,V,P,bc,b)
y = (A/2) * (1 - V * sin((b-bc) / P));
end
Getting hold of the data
My next step is going to be to generate some data to work with (you'll use your own data, naturally). So here's a function that generates some noisy data. Notice that I've supplied some values for the parameters.
function [y b] = generateData(npoints,noise)
A = 2;
V = 1;
P = 0.7;
bc = 0;
b = 2 * pi * rand(npoints,1);
y = targetfunction(A,V,P,bc,b) + noise * randn(npoints,1);
end
The function rand generates random points on the interval [0,1], and I multiplied those by 2*pi to get points randomly on the interval [0, 2*pi]. I then applied the target function at those points, and added a bit of noise (the function randn generates normally distributed random variables).
Fitting parameters
The most complicated function is the one that fits a model to your data. For this I use the function fminunc, which does unconstrained minimization. The routine looks like this:
function [A V P bc] = bestfit(y,b)
x0(1) = 1; %# A
x0(2) = 1; %# V
x0(3) = 0.5; %# P
x0(4) = 0; %# bc
f = #(x) norm(y - targetfunction(x(1),x(2),x(3),x(4),b));
x = fminunc(f,x0);
A = x(1);
V = x(2);
P = x(3);
bc = x(4);
end
Let's go through line by line. First, I define the function f that I want to minimize. This isn't too hard. To minimize a function in Matlab, it needs to take a single vector as a parameter. Therefore we have to pack our four parameters into a vector, which I do in the first four lines. I used values that are close, but not the same, as the ones that I used to generate the data.
Then I define the function I want to minimize. It takes a single argument x, which it unpacks and feeds to the targetfunction, along with the points b in our dataset. Hopefully these are close to y. We measure how far they are from y by subtracting from y and applying the function norm, which squares every component, adds them up and takes the square root (i.e. it computes the root mean square error).
Then I call fminunc with our function to be minimized, and the initial guess for the parameters. This uses an internal routine to find the closest match for each of the parameters, and returns them in the vector x.
Finally, I unpack the parameters from the vector x.
Putting it all together
We now have all the components we need, so we just want one final function to tie them together. Here it is:
function master
%# Generate some data (you should read in your own data here)
[f b] = generateData(1000,1);
%# Find the best fitting parameters
[A V P bc] = bestfit(f,b);
%# Print them to the screen
fprintf('A = %f\n',A)
fprintf('V = %f\n',V)
fprintf('P = %f\n',P)
fprintf('bc = %f\n',bc)
%# Make plots of the data and the function we have fitted
plot(b,f,'.');
hold on
plot(sort(b),targetfunction(A,V,P,bc,sort(b)),'r','LineWidth',2)
end
If I run this function, I see this being printed to the screen:
>> master
Local minimum found.
Optimization completed because the size of the gradient is less than
the default value of the function tolerance.
A = 1.991727
V = 0.979819
P = 0.695265
bc = 0.067431
And the following plot appears:
That fit looks good enough to me. Let me know if you have any questions about anything I've done here.
I am a bit surprised as you mention f(a) and your function does not contain an a, but in general, suppose you want to plot f(x) = cos(x)^2
First determine for which values of x you want to make a plot, for example
xmin = 0;
stepsize = 1/100;
xmax = 6.5;
x = xmin:stepsize:xmax;
y = cos(x).^2;
plot(x,y)
However, note that this approach works just as well in excel, you just have to do some work to get your x values and function in the right cells.

How can I speed up this call to quantile in Matlab?

I have a MATLAB routine with one rather obvious bottleneck. I've profiled the function, with the result that 2/3 of the computing time is used in the function levels:
The function levels takes a matrix of floats and splits each column into nLevels buckets, returning a matrix of the same size as the input, with each entry replaced by the number of the bucket it falls into.
To do this I use the quantile function to get the bucket limits, and a loop to assign the entries to buckets. Here's my implementation:
function [Y q] = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
if isvector(q)
q=transpose(q);
end
Y = zeros(size(X));
for i = 1:nLevels
% "The variables g and l indicate the entries that are respectively greater than
% or less than the relevant bucket limits. The line Y(g & l) = i is assigning the
% value i to any element that falls in this bucket."
if i ~= nLevels % "The default; doesnt include upper bound"
g = bsxfun(#ge,X,q(i,:));
l = bsxfun(#lt,X,q(i+1,:));
else % "For the final level we include the upper bound"
g = bsxfun(#ge,X,q(i,:));
l = bsxfun(#le,X,q(i+1,:));
end
Y(g & l) = i;
end
Is there anything I can do to speed this up? Can the code be vectorized?
If I understand correctly, you want to know how many items fell in each bucket.
Use:
n = hist(Y,nbins)
Though I am not sure that it will help in the speedup. It is just cleaner this way.
Edit : Following the comment:
You can use the second output parameter of histc
[n,bin] = histc(...) also returns an index matrix bin. If x is a vector, n(k) = >sum(bin==k). bin is zero for out of range values. If x is an M-by-N matrix, then
How About this
function [Y q] = levels(X,nLevels)
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
Y = zeros(size(X));
for i = 1:numel(q)-1
Y = Y+ X>=q(i);
end
This results in the following:
>>X = [3 1 4 6 7 2];
>>[Y, q] = levels(X,2)
Y =
1 1 2 2 2 1
q =
1 3.5 7
You could also modify the logic line to ensure values are less than the start of the next bin. However, I don't think it is necessary.
I think you shoud use histc
[~,Y] = histc(X,q)
As you can see in matlab's doc:
Description
n = histc(x,edges) counts the number of values in vector x that fall
between the elements in the edges vector (which must contain
monotonically nondecreasing values). n is a length(edges) vector
containing these counts. No elements of x can be complex.
I made a couple of refinements (including one inspired by Aero Engy in another answer) that have resulted in some improvements. To test them out, I created a random matrix of a million rows and 100 columns to run the improved functions on:
>> x = randn(1000000,100);
First, I ran my unmodified code, with the following results:
Note that of the 40 seconds, around 14 of them are spent computing the quantiles - I can't expect to improve this part of the routine (I assume that Mathworks have already optimized it, though I guess that to assume makes an...)
Next, I modified the routine to the following, which should be faster and has the advantage of being fewer lines as well!
function [Y q] = levels(X,nLevels)
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
if isvector(q), q = transpose(q); end
Y = ones(size(X));
for i = 2:nLevels
Y = Y + bsxfun(#ge,X,q(i,:));
end
The profiling results with this code are:
So it is 15 seconds faster, which represents a 150% speedup of the portion of code that is mine, rather than MathWorks.
Finally, following a suggestion of Andrey (again in another answer) I modified the code to use the second output of the histc function, which assigns entries to bins. It doesn't treat the columns independently, so I had to loop over the columns manually, but it seems to be performing really well. Here's the code:
function [Y q] = levels(X,nLevels)
p = linspace(0,1,nLevels+1);
q = quantile(X,p);
if isvector(q), q = transpose(q); end
q(end,:) = 2 * q(end,:);
Y = zeros(size(X));
for k = 1:size(X,2)
[junk Y(:,k)] = histc(X(:,k),q(:,k));
end
And the profiling results:
We now spend only 4.3 seconds in codes outside the quantile function, which is around a 500% speedup over what I wrote originally. I've spent a bit of time writing this answer because I think it's turned into a nice example of how you can use the MATLAB profiler and StackExchange in combination to get much better performance from your code.
I'm happy with this result, although of course I'll continue to be pleased to hear other answers. At this stage the main performance increase will come from increasing the performance of the part of the code that currently calls quantile. I can't see how to do this immediately, but maybe someone else here can. Thanks again!
You can sort the columns and divide+round the inverse indexes:
function Y = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
[S,IX]=sort(X);
[grid1,grid2]=ndgrid(1:size(IX,1),1:size(IX,2));
invIX=zeros(size(X));
invIX(sub2ind(size(X),IX(:),grid2(:)))=grid1;
Y=ceil(invIX/size(X,1)*nLevels);
Or you can use tiedrank:
function Y = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
R=tiedrank(X);
Y=ceil(R/size(X,1)*nLevels);
Surprisingly, both these solutions are slightly slower than the quantile+histc solution.