How can I speed up the following operations? The bottleneck is the third line even though for large dimensions of A1, the forth line is quite fast. Does the third line actually make a copy of A(b,b) that is stored in A1?
A = randn(1000,1000);
b = [67 145 200 185 11 166 80 137 163 132 133 19]; %random
A1 = A(b,b);
v=A1*A(2,b)';
Note that the following is just as slow so I just broke up that line into two parts to demonstrate that the third line above is the bottleneck.
v=A(b,b);*A(2,b)';
See if this makes it faster as a replacement for the third line that you claim to be the bottleneck -
[x,y]= ndgrid(b,b);
A1 = A((y-1)*size(A,1)+x);
Or
A1 = A(bsxfun(#plus,(b-1)*size(A,1),b'));
Edit: After profiling with the above listed codes, the runtime performance doesn't look to have improved by a lot. You mentioned in the comments that you are using these codes in a loop multiple times and b varies. If the loop count was a small value and b was a constant between those loop iterations, one could have thought of performing all those matrix-vector multiplications across all iterations into one big matrix-matrix multiplication, but this isn't the case here. So, I would say at this point that the bottleneck with the indexing might have to stay.
Related
How can I access different rows from multiple pages in a 3D array while avoiding a for-loop?
Let's assume I have a 10x5x3 matrix (mat1) and I would like to copy different individual rows from the three pages (such as the 4th, 2nd, and 5th row of the 1st, 2nd, and 3rd page) into the first row of another 10x5x3 matrix (mat2).
My solution uses a for-loop. What about vectorization?
mat1 = randi(100, 10, 5, 3)
mat2 = nan(size(mat1))
rows_to_copy = [4, 2, 5]
for i = 1 : 3
mat2(1, :, i) = mat1(rows_to_copy(i), :, i)
end
Any vectorized solution is likely not going to be as simple as your for loop solution, and might actually be less efficient (edit: see timing tests below). However, if you're curious, vectorizing an indexing operation like this generally involves converting your desired indices from subscripts to linear indices. Normally you can do this using sub2ind, but since you're selecting entire rows it may be more efficient to calculate the index yourself.
Here's a solution that takes advantage of implicit expansion in newer versions of MATLAB (R2016b and later):
[R, C, D] = size(mat1);
index = rows_to_copy+R.*(0:(C-1)).'+R*C.*(0:(D-1));
mat2(1, :, :) = reshape(mat1(index), 1, C, D);
Note that if you don't really need all the extra space full of NaN values in mat2, you can make your result more compact by just concatenating all the rows into a 2-D matrix instead:
>> mat2 = mat1(index).'
mat2 =
95 41 2 19 44
38 31 93 27 27
49 10 72 91 49
And if you're still using an older version of MATLAB without implicit expansion, you can use bsxfun instead:
index = bsxfun(#plus, rows_to_copy+R*C.*(0:(D-1)), R.*(0:(C-1)).');
Timing
I ran some tests using timeit (R2018a, Windows 7 64-bit) to see how the loop and indexing solutions compared. I tested 3 different scenarios: increasing row size, increasing column size, and increasing page size (third dimension) for mat1. The rows_to_copy was randomly selected and always had the same number of elements as the page size of mat1. Here are the results, showing the ratio of the loop time versus the indexing time:
Aside from some transient noise, there are some clear patterns. Increasing either the number of rows or columns (blue or red lines) doesn't appreciably change the time ratio, which hovers in the range of 0.7 to 0.9, meaning the for loop is slightly faster on average. Increasing the number of pages (yellow line) means the for loop has to iterate more times, and the indexing solution quickly starts to win out, reaching an 8 times speedup when the page size exceeds about 150.
I have a 164 x 246 matrix called M. M is data for time series containing 246 time points of 164 brain regions. I want to work on only specific blocks of the time series, not the whole thing. To do so, I created a vector called onsets containing the time onset of each block.
onsets = [7;37;82;112;145;175;190;220];
In this example, there are 8 blocks total (though this number can vary), each blocks containing 9 time points. So for instance, the first block would contain time point 7, 8, 9,..., 15; the second block would contain time point 37, 38, 39,..., 45. I would like to extract the time points for these 8 blocks from M and concatenate 8 these blocks. Thus, the output should be a 164 x 72 matrix (i.e., 164 regions, 8 blocks x 9 time points/per block).
This seems like a very simple indexing problem but I'm struggling to do this efficiently. I've tried indexing each block in M (for intance, vertcat(M(onsets(1,1):onsets(1,1)+8,:));) then use vertcat but this seems very clumsy. Can anyone help?
Try this:
% create sample data
M = rand(164,246);
% create index vector
idx = false(1,size(M,2));
onsets = [7;37;82;112;145;175;190;220];
for i=1:numel(onsets)
idx(onsets(i):onsets(i)+8) = true;
end
% create output matrix
MM = M(:,idx);
You seem to have switched the dimensions somehow, i.e. you try to operate on the rows of M whilst according to your description you need to operate on the columns. Hope this helps.
Is there a way to force MATLAB to use single precision as default precision?
I have a MATLAB code, whose output I need to compare to C code output, and C code is written exclusively using floats, no doubles allowed.
Short answer: You can't.
Longer answer: In most cases, you can get around this by setting your initial variables to single. Once that's done, that type will (almost always) propagate down through your code. (cf. this and this thread on MathWorks).
So, for instance, if you do something like:
>> x = single(magic(4));
>> y = double(6);
>> x * y
ans =
4×4 single matrix
96 12 18 78
30 66 60 48
54 42 36 72
24 84 90 6
MATLAB keeps the answer in the lower precision. I have occasionally encountered functions, both built-in and from the FileExchange, that recast the output to be a double, so you will want to sprinkle in the occasional assert statement to keep things honest during your initial debugging (or better yet put the assertion as the first lines of any sub-functions you write to check the critical inputs), but this should get you 99% of the way there.
You can convert any object A to single precision using A=single(A);
The Mathworks forums show that
in your case: system-specific('precision','8'); should do it. Try this in the console or add at the top of your script.
I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end
I have a matrix in MATLAB, lets say:
a = [
89 79 96
72 51 74
94 88 87
69 47 78
]
I want to subtract from each element the average of its column and divide by the column's standard deviation. How can I do it in a way which could be implemented to any other matrix without using loops.
thanks
If your version supports bsxfun (which is probably the case unless you have very old matlab version), you should use it, it's much faster than repmat, and consumes much less memory.
You can just do: result = bsxfun(#rdivide,bsxfun(#minus,a,mean(a)),std(a))
You can use repmat to make your average/std vector the same size as your original matrix, then use direct computation like so:
[rows, cols] = size(a); %#to get the number of rows
avgc= repmat(avg(a),[rows 1]); %# average by column, vertically replicated by number of rows
stdc= repmat(std(a),[rows 1]); %# std by column, vertically replicated by number of rows
%# Here, a, avgc and stdc are the same size
result= (a - avgc) ./ stdc;
Edit:
Judging from a mathworks blog post,bsxfun solution is faster and consumes less memory (see acai answer). For moderate size matrices, I personally prefer repmat that makes code easier to read and debug (for me).
You could also use the ZSCORE function from the Statistics Toolbox:
result = zscore(a)
In fact, it calls BSXFUN underneath, but it is careful not to divide by a zero standard deviation (you can look at the source code yourself: edit zscore)