Which Bins are occupied in a 3D histogram in MatLab - matlab

I got 3D data, from which I need to calculate properties.
To reduce computung I wanted to discretize the space and calculate the properties from the Bin instead of the individual data points and then reasign the propertie caclulated from the bin back to the datapoint.
I further only want to calculate the Bins which have points within them.
Since there is no 3D-binning function in MatLab, what i do is using histcounts over each dimension and then searching for the unique Bins that have been asigned to the data points.
a5pre=compositions(:,1);
a7pre=compositions(:,2);
a8pre=compositions(:,3);
%% BINNING
a5pre_edges=[0,linspace(0.005,0.995,19),1];
a5pre_val=(a5pre_edges(1:end-1) + a5pre_edges(2:end))/2;
a5pre_val(1)=0;
a5pre_val(end)=1;
a7pre_edges=[0,linspace(0.005,0.995,49),1];
a7pre_val=(a7pre_edges(1:end-1) + a7pre_edges(2:end))/2;
a7pre_val(1)=0;
a7pre_val(end)=1;
a8pre_edges=a7pre_edges;
a8pre_val=a7pre_val;
[~,~,bin1]=histcounts(a5pre,a5pre_edges);
[~,~,bin2]=histcounts(a7pre,a7pre_edges);
[~,~,bin3]=histcounts(a8pre,a8pre_edges);
bins=[bin1,bin2,bin3];
[A,~,C]=unique(bins,'rows','stable');
a5pre=a5pre_val(A(:,1));
a7pre=a7pre_val(A(:,2));
a8pre=a8pre_val(A(:,3));
It seems like that the unique function is pretty time consuming, so I was wondering if there is a faster way to do it, knowing that the line only can contain integer or so... or a totaly different.
Best regards

function [comps,C]=compo_binner(x,y,z,e1,e2,e3,v1,v2,v3)
C=NaN(length(x),1);
comps=NaN(length(x),3);
id=1;
for i=1:numel(x)
B_temp(1,1)=v1(sum(x(i)>e1));
B_temp(1,2)=v2(sum(y(i)>e2));
B_temp(1,3)=v3(sum(z(i)>e3));
C_id=sum(ismember(comps,B_temp),2)==3;
if sum(C_id)>0
C(i)=find(C_id);
else
comps(id,:)=B_temp;
id=id+1;
C_id=sum(ismember(comps,B_temp),2)==3;
C(i)=find(C_id>0);
end
end
comps(any(isnan(comps), 2), :) = [];
end
But its way slower than the histcount, unique version. Cant avoid find-function, and thats a function you sure want to avoid in a loop when its about speed...

If I understand correctly you want to compute a 3D histogram. If there's no built-in tool to compute one, it is simple to write one:
function [H, lindices] = histogram3d(data, n)
% histogram3d 3D histogram
% H = histogram3d(data, n) computes a 3D histogram from (x,y,z) values
% in the Nx3 array `data`. `n` is the number of bins between 0 and 1.
% It is assumed all values in `data` are between 0 and 1.
assert(size(data,2) == 3, 'data must be Nx3');
H = zeros(n, n, n);
indices = floor(data * n) + 1;
indices(indices > n) = n;
lindices = sub2ind(size(H), indices(:,1), indices(:,2), indices(:,3));
for ii = 1:size(data,1)
H(lindices(ii)) = H(lindices(ii)) + 1;
end
end
Now, given your compositions array, and binning each dimension into 20 bins, we get:
[H, indices] = histogram3d(compositions, 20);
idx = find(H);
[x,y,z] = ind2sub(size(H), idx);
reduced_compositions = ([x,y,z] - 0.5) / 20;
The bin centers for H are at ((1:20)-0.5)/20.
On my machine this runs in a fraction of a second for 5 million inputs points.
Now, for each composition(ii,:), you have a number indices(ii), which matches with another number idx[jj], corresponding to reduced_compositions(jj,:). One easy way to make the assignment of results is as follows:
H(H > 0) = 1:numel(idx);
indices = H(indices);
Now for each composition(ii,:), your closest match in the reduced set is reduced_compositions(indices(ii),:).

Related

Verify Law of Large Numbers in MATLAB

The problem:
If a large number of fair N-sided dice are rolled, the average of the simulated rolls is likely to be close to the mean of 1,2,...N i.e. the expected value of one die. For example, the expected value of a 6-sided die is 3.5.
Given N, simulate 1e8 N-sided dice rolls by creating a vector of 1e8 uniformly distributed random integers. Return the difference between the mean of this vector and the mean of integers from 1 to N.
My code:
function dice_diff = loln(N)
% the mean of integer from 1 to N
A = 1:N
meanN = sum(A)/N;
% I do not have any idea what I am doing here!
V = randi(1e8);
meanvector = V/1e8;
dice_diff = meanvector - meanN;
end
First of all, make sure everytime you ask a question that it is as clear as possible, to make it easier for other users to read.
If you check how randi works, you can see this:
R = randi(IMAX,N) returns an N-by-N matrix containing pseudorandom
integer values drawn from the discrete uniform distribution on 1:IMAX.
randi(IMAX,M,N) or randi(IMAX,[M,N]) returns an M-by-N matrix.
randi(IMAX,M,N,P,...) or randi(IMAX,[M,N,P,...]) returns an
M-by-N-by-P-by-... array. randi(IMAX) returns a scalar.
randi(IMAX,SIZE(A)) returns an array the same size as A.
So, if you want to use randi in your problem, you have to use it like this:
V=randi(N, 1e8,1);
and you need some more changes:
function dice_diff = loln(N)
%the mean of integer from 1 to N
A = 1:N;
meanN = mean(A);
V = randi(N, 1e8,1);
meanvector = mean(V);
dice_diff = meanvector - meanN;
end
For future problems, try using the command
help randi
And matlab will explain how the function randi (or other function) works.
Make sure to check if the code above gives the desired result
As pointed out, take a closer look at the use of randi(). From the general case
X = randi([LowerInt,UpperInt],NumRows,NumColumns); % UpperInt > LowerInt
you can adapt to dice rolling by
Rolls = randi([1 NumSides],NumRolls,NumSamplePaths);
as an example. Exchanging NumRolls and NumSamplePaths will yield Rolls.', or transpose(Rolls).
According to the Law of Large Numbers, the updated sample average after each roll should converge to the true mean, ExpVal (short for expected value), as the number of rolls (trials) increases. Notice that as NumRolls gets larger, the sample mean converges to the true mean. The image below shows this for two sample paths.
To get the sample mean for each number of dice rolls, I used arrayfun() with
CumulativeAvg1 = arrayfun(#(jj)mean(Rolls(1:jj,1)),[1:NumRolls]);
which is equivalent to using the cumulative sum, cumsum(), to get the same result.
CumulativeAvg1 = (cumsum(Rolls(:,1))./(1:NumRolls).'); % equivalent
% MATLAB R2019a
% Create Dice
NumSides = 6; % positive nonzero integer
NumRolls = 200;
NumSamplePaths = 2;
% Roll Dice
Rolls = randi([1 NumSides],NumRolls,NumSamplePaths);
% Output Statistics
ExpVal = mean(1:NumSides);
CumulativeAvg1 = arrayfun(#(jj)mean(Rolls(1:jj,1)),[1:NumRolls]);
CumulativeAvgError1 = CumulativeAvg1 - ExpVal;
CumulativeAvg2 = arrayfun(#(jj)mean(Rolls(1:jj,2)),[1:NumRolls]);
CumulativeAvgError2 = CumulativeAvg2 - ExpVal;
% Plot
figure
subplot(2,1,1), hold on, box on
plot(1:NumRolls,CumulativeAvg1,'b--','LineWidth',1.5,'DisplayName','Sample Path 1')
plot(1:NumRolls,CumulativeAvg2,'r--','LineWidth',1.5,'DisplayName','Sample Path 2')
yline(ExpVal,'k-')
title('Average')
xlabel('Number of Trials')
ylim([1 NumSides])
subplot(2,1,2), hold on, box on
plot(1:NumRolls,CumulativeAvgError1,'b--','LineWidth',1.5,'DisplayName','Sample Path 1')
plot(1:NumRolls,CumulativeAvgError2,'r--','LineWidth',1.5,'DisplayName','Sample Path 2')
yline(0,'k-')
title('Error')
xlabel('Number of Trials')

Vectorizing multiple gaussian calculations

I am attempting to vectorize some of my code that adds the intensity of many Gaussian distributions over an image. I currently loop over the function 'gaussIt2D' for each Gaussian, which is vectorized for a single 2D gaussian:
windowSize=10;
imSize=[512,512];
%pointsR is an nx2 array of coordinates [x1,y1;x2,y2;...;xn,yn]
pointsR=rand(100,2)*511+1;
%sigmaR is the standard deviation of the gaussian being created
sigmaR = 1;
outputImage=zeros(imSize);
for n=1:size(pointsR,1)
rangeX = floor(pointsR(n,1)-windowSize):ceil(pointsR(n,1)+ windowSize);
rangeX = rangeX(rangeX > 0 & rangeX <= imSize(1));
rangeY = floor(pointsR(n,2)-windowSize):ceil(pointsR(n,2)+windowSize);
rangeY = rangeY(rangeY > 0 & rangeY <= imSize(2));
outputImage(rangeX,rangeY) = outputImage(rangeX,rangeY)+gaussIt2D(rangeX(1),rangeX(end),rangeY(1),rangeY(end),[sigmaR,pointsR(n,1),pointsR(n,2)]);
end
function [result] = gaussIt2D(xInit,xFinal,yInit,yFinal,sigma,xCenter,yCenter)
%Returns gaussian intenisty values for the region defined by [xInit:xFinal,yInit:yFinal] using the gaussian properties sigma,centerX,centerY
[gridX,gridY]=ndgrid(xInit:xFinal,yInit:yFinal);
result=exp( -( (gridX-xCenter).^2 + (gridY-yCenter).^2 ) ./ (2*sigma.^2) );
end
I am trying to further vectorize this process by allowing the gaussIt2D function to accept a vector of x and y values and a vector of x and y centers and do all of them together. My thought process so far has been to try to stack the grids and replicate the centers and do the element-wise gaussian calculations. For (a simplified) example if:
xInits = [1,2,3];
xFinals = [2,3,4];
xCenters = [1.2,2.8,3.1];
yInits = [1,2,3];
yFinals = [2,3,4];
yCenters = [1.5,2.4,3.6];
Then I was thinking to create grids and centers following the form:
gridX = [1,2
1,2
2,3
2,3
3,4
3,4]
xCenters = [1.2,1.2
1.2,1.2
2.8,2.8
2.8,2.8
3.1,3.1
3.1,3.1]
This could then be used in the same gaussian equation used in the original function. However, generating these arrays is tripping me up. What I have right now is:
function [result]=gaussIt2DVectorized(xInits,xFinals,yInits,yFinals,sigmas,xCenters,yCenters)
%Incomplete
%Returns gaussian intenisty values for the region defined by
%[xInit:xFinal,yInit:yFinal] using the values array:[sigma,centerX,centerY]
[gridX,gridY]=arrayfun('ndgrid',xInits:xFinals,yInits:yFinals);
xCenters = repelem(xCenters,numel(xInits(1):xFinals(1)), numel(yInits(1):yFinals(1)));
yCenters = repelem(yCenters,numel(xInits(1):xFinals(1)), numel(yInits(1):yFinals(1)));
result=exp( -( (gridX-xCenters).^2 + (gridY-yCenters).^2 ) ./ (2*sigmas^2) );
end
This doesn't actually work though, and the I also anticipate difficulty accounting for ranges (ie xInit:xFinal) of different lengths.
Any help, tips, or alternate methods would be appriciated.
Thanks.
Since you cannot be sure that the grids will all be the same size, it's probably best to store them in a cell array rather than stacking them in a matrix. With cells arrays, you can still run your calculation without looping using cellfun.
For example:
function [result] = gaussIt2D_better(xInits,xFinals,yInits,yFinals,sigmas,xCenters,yCenters)
[gridsX, gridsY] = arrayfun(#(x) ndgrid(xInits(x):xFinals(x), yInits(x):yFinals(x)),1:length(xInits),'UniformOutput',0);
f=#(gridX, gridY, xCenter, yCenter, sigma) exp( -( (gridX-xCenter).^2 + (gridY-yCenter).^2 ) ./ (2*sigma.^2) );
result=cellfun(f, gridsX, gridsY, num2cell(xCenters), num2cell(yCenters), num2cell(sigmas), 'UniformOutput',0);
end
Note that in this example, the value returned is a cell array with the same length as the input vectors, one result for each.

how to vectorize array reformatting?

I have a .csv file with data on each line in the format (x,y,z,t,f), where f is the value of some function at location (x,y,z) at time t. So each new line in the .csv gives a new set of coordinates (x,y,z,t), with accompanying value f. The .csv is not sorted.
I want to use imagesc to create a video of this data in the xy-plane, as time progresses. The way I've done this is by reformatting M into something more easily usable by imagesc. I'm doing three nested loops, roughly like this
M = csvread('file.csv');
uniqueX = unique(M(:,1));
uniqueY = unique(M(:,2));
uniqueT = unique(M(:,4));
M_reformatted = zeros(length(uniqueX), length(uniqueY), length(uniqueT));
for i = 1:length(uniqueX)
for j = 1:length(uniqueY)
for k = 1:length(uniqueT)
M_reformatted(i,j,k) = M( ...
M(:,1)==uniqueX(i) & ...
M(:,2)==uniqueY(j) & ...
M(:,4)==uniqueT(k), ...
5 ...
);
end
end
end
once I have M_reformatted, I can loop through timesteps k and use imagesc on M_reformatted(:,:,k). But doing the above nested loops is very slow. Is it possible to vectorize the above? If so, an outline of the approach would be very helpful.
edit: as noted in answers/comments below, I made a mistake in that there are several possible z-values, which I haven't taken into account. If only a single z-value, the above would be ok.
This vectorized solution allows for negative values of x and y and is many times faster than the non-vectorized solution (close to 20x times for the test case at the bottom).
The idea is to sort the x, y, and t values in lexicographical order using sortrows and then using reshape to build the time slices of M_reformatted.
The code:
idx = find(M(:,3)==0); %// find rows where z==0
M2 = M(idx,:); %// M2 has only the rows where z==0
M2(:,3) = []; %// delete z coordinate in M2
M2(:,[1 2 3]) = M2(:,[3 1 2]); %// change from (x,y,t,f) to (t,x,y,f)
M2 = sortrows(M2); %// sort rows by t, then x, then y
numT = numel(unique(M2(:,1))); %// number of unique t values
numX = numel(unique(M2(:,2))); %// number of unique x values
numY = numel(unique(M2(:,3))); %// number of unique y values
%// fill the time slice matrix with data
M_reformatted = reshape(M2(:,4), numY, numX, numT);
Note: I am assuming y refers to the columns of the image and x refers to the rows. If you want these flipped, use M_reformatted = permute(M_reformatted,[2 1 3]) at the end of the code.
The test case I used for M (to compare the result to other solutions) has a NxNxN space with T times slices:
N = 10;
T = 10;
[x,y,z] = meshgrid(-N:N,-N:N,-N:N);
numPoints = numel(x);
x=x(:); y=y(:); z=z(:);
s = repmat([x,y,z],T,1);
t = repmat(1:T,numPoints,1);
M = [s, t(:), rand(numPoints*T,1)];
M = M( randperm(size(M,1)), : );
I don't think you need to vectorize. I think you change your algorithm.
You only need one loop to step through the lines of the CSV file. For every line, you have (x,y,z,t,f) so just store it in M_reformatted where it belongs. Something like this:
M_reformatted = zeros(max(M(:,1)), max(M(:,2)), max(M(:,4)));
for line = 1:size(M,2)
z = M(line, 3);
if z ~= 0, continue; end;
x = M(line, 1);
y = M(line, 2);
t = M(line, 4);
f = M(line, 5);
M_reformatted(x, y, t) = f;
end
Also note that pre-allocating M_reformatted is a very good idea, but your code may have been getting the size wrong (depending on the data). I think using max like I did will always do the right thing.

How to make a graph from function output in matlab

I'm completely lost at this using MATLAB functions, so here is the case:
lets assume I have SUM=0, and
I have a constant probability P that the user gives me, and I have to compare this constant P, with other M (also user gives M) random probabilities, if P is larger I add 1 to SUM, if P is smaller I add -1 to SUM... and at the end I want print on the screen the graph of the process.
I managed till now to make only one stage with this code:
function [result] = ex1(p)
if (rand>=p) result=1;
else result=-1;
end
(its like M=1)
How do You suggest I can modify this code in order to make it work the way I described it before (including getting a graph) ?
Or maybe I'm getting the logic wrong? the question says I get 1 with probability P, and -1 with probability (1-P), and the SUM is the same
Many thanks
I'm not sure how you achieve your input, but this should get you on the way:
p = 0.5; % Constant probability
m = 10;
randoms = rand(m,1) % Random probabilities
results = ones(m,1);
idx = find(randoms < p)
results(idx) = -1;
plot(cumsum(results))
For m = 1000:
You can do it like this:
p = 0.25; % example data
M = 20; % example data
random = rand(M,1); % generate values
y = cumsum(2*(random>=p)-1); % compute cumulative sum of +1/-1
plot(y) % do the plot
The important function here is cumsum, which does the cumulative sum on the sequence of +1/-1 values generated by 2*(random>=p)-1.
Example graph with p=0.5, M=2000:

Speeding up the conditional filling of huge sparse matrices

I was wondering if there is a way of speeding up (maybe via vectorization?) the conditional filling of huge sparse matrices (e.g. ~ 1e10 x 1e10). Here's the sample code where I have a nested loop, and I fill in a sparse matrix only if a certain condition is met:
% We are given the following cell arrays of the same size:
% all_arrays_1
% all_arrays_2
% all_mapping_arrays
N = 1e10;
% The number of nnz (non-zeros) is unknown until the loop finishes
huge_sparse_matrix = sparse([],[],[],N,N);
n_iterations = numel(all_arrays_1);
for iteration=1:n_iterations
array_1 = all_arrays_1{iteration};
array_2 = all_arrays_2{iteration};
mapping_array = all_mapping_arrays{iteration};
n_elements_in_array_1 = numel(array_1);
n_elements_in_array_2 = numel(array_2);
for element_1 = 1:n_elements_in_array_1
element_2 = mapping_array(element_1);
% Sanity check:
if element_2 <= n_elements_in_array_2
item_1 = array_1(element_1);
item_2 = array_2(element_2);
huge_sparse_matrix(item_1,item_2) = 1;
end
end
end
I am struggling to vectorize the nested loop. As far as I understand the filling a sparse matrix element by element is very slow when the number of entries to fill is large (~100M). I need to work with a sparse matrix since it has dimensions in the 10,000M x 10,000M range. However, this way of filling a sparse matrix in MATLAB is very slow.
Edits:
I have updated the names of the variables to reflect their nature better. There are no function calls.
Addendum:
This code builds the matrix adjacency for a huge graph. The variable all_mapping_arrays holds mapping arrays (~ adjacency relationship) between nodes of the graph in a local representation, which is why I need array_1 and array_2 to map the adjacency to a global representation.
I think it will be the incremental update of the sparse matrix, rather than the loop based conditional that will be slowing things down.
When you add a new entry to a sparse matrix via something like A(i,j) = 1 it typically requires that the whole matrix data structure is re-packed. The is an expensive operation. If you're interested, MATLAB uses a CCS data structure (compressed column storage) internally, which is described under the Data Structure section here. Note the statement:
This scheme is not effcient for manipulating matrices one element at a
time
Generally, it's far better (faster) to accumulate the non-zero entries in the matrix as a set of triplets and then make a single call to sparse. For example (warning - brain compiled code!!):
% Inputs:
% N
% prev_array and next_array
% n_labels_prev and n_labels_next
% mapping
% allocate space for matrix entries as a set of "triplets"
ii = zeros(N,1);
jj = zeros(N,1);
xx = zeros(N,1);
nn = 0;
for next_label_ix = 1:n_labels_next
prev_label = mapping(next_label_ix);
if prev_label <= n_labels_prev
prev_global_label = prev_array(prev_label);
next_global_label = next_array(next_label_ix);
% reallocate triplets on demand
if (nn + 1 > length(ii))
ii = [ii; zeros(N,1)];
jj = [jj; zeros(N,1)];
xx = [xx; zeros(N,1)];
end
% append a new triplet and increment counter
ii(nn + 1) = next_global_label; % row index
jj(nn + 1) = prev_global_label; % col index
xx(nn + 1) = 1.0; % coefficient
nn = nn + 1;
end
end
% we may have over-alloacted our triplets, so trim the arrays
% based on our final counter
ii = ii(1:nn);
jj = jj(1:nn);
xx = xx(1:nn);
% just make a single call to "sparse" to pack the triplet data
% as a sparse matrix object
sp_graph_adj_global = sparse(ii,jj,xx,N,N);
I'm allocating in chunks of N entries at a time. Assuming that you know alot about the structure of your matrix you might be able to use a better value here.
Hope this helps.