I am trying to create a connectivity matrix for a graph with N nodes. The connectivity rules state that it should have 1000 randomly assigned one way connections (nodes cannot be connected to themselves).
What I want to do is to generate a matrix NxN with mostly zeroes and 1000 ones in random places, but not on the main diagonal.
I really don't have any ideas on how to achieve this. I thought about generating a matrix of random numbers between 0 and N/1000 and then making those above (N-1)/1000 to be one and the rest 0, but this isn't very precise (I may get more or less than 1000) and I don't know what to do about the diagonal.
What about this. Find the indices of non-diagonal elements. Choose some of those at random, and then populate those indices with ones:
nn = 10; % Size of matrix
nr = 20; % number of random connections
ident = eye(nn);
nd_idx = find(~ident); % Indices of non-diag elements
con = randperm(numel(nd_idx), nr); % Pick random elements
m = zeros(nn);
m( nd_idx(con) ) = 1;
If you want to get a matrix with exactly 1000 randomly located true values, my suggestion is to create a random matrix, and use the lowest or highest 1000 elements. To remove the diagonal, use eye(). So, something like this
N = 5000;
nNodes = 1000;
a = rand (N);
a(eye (N)) = 2;
threshold = sort (a(:))(nNodes);
b = false (N);
b(a >= threshold) = true;
I think Matlab hasn't implemented indexing of variable outputs yet, that's still only available in Octave. If that's the case, you will need to use a temporary variable to hold the sorted array which can take some memory for large matrices.
threshold = sort (a(:));
threshold = threshold(nNodes);
Generate random matrix A
Round items
Generate matrix of 1s with 0s on main diagonal B (you can create matrix of ones than substract matrix with 1s on main diagonal from it)
Multply A by B
#!/usr/bin/python
import sys
from random import randint
if len(sys.argv)!=3:
sys.exit("usage is :"+sys.argv[0]+" matrix-size num-of-connections")
matrixSize = int(sys.argv[1])
numOfConnections = int(sys.argv[2])
i = 0
while (i < numOfConnections):
a = randint(1, matrixSize)
b = randint(1, matrixSize)
if (a==b):
continue
i+=1
print "connection from %d to %d"%(a,b)
Related
I have a 102-by-102 matrix. I want to select square sub-matrices of orders from 2 up to 8 using random column numbers. Here is what I have done so far.
matt is the the original matrix of size 102-by-102.
ittr = 30
cols = 3;
for i = 1:ittr
rr = randi([2,102], cols,1);
mattsub = matt([rr(1) rr(2) rr(3)], [rr(1) rr(2) rr(3)]);
end
I have to extract matrices of different orders from 2 to 8. Using the above code I would have to change the mattsub line every time I change cols. I believe it is possible to do with another loop inside but cannot figure out how. How can I do this?
There is no need to extract elements of a vector and concatenate them, just use the vector to index a matrix.
Instead of :
mattsub = matt([rr(1) rr(2) rr(3)], [rr(1) rr(2) rr(3)]);
Use this:
mattsub = matt(rr, rr);
Defining a set of random sizes is pretty easy using the randi function. Once this is done, they can be projected along your iterations number N using arrayfun. Within the iterations, the randperm and sort functions can be used in order to build the random indexers to the original matrix M.
Here is the full code:
% Define the starting parameters...
M = rand(102);
N = 30;
% Retrieve the matrix rows and columns...
M_rows = size(M,1);
M_cols = size(M,2);
% Create a vector of random sizes between 2 and 8...
sizes = randi(7,N,1) + 1;
% Generate the random submatrices and insert them into a vector of cells...
subs = arrayfun(#(x)M(sort(randperm(M_rows,x)),sort(randperm(M_cols,x))),sizes,'UniformOutput',false);
This can work on any type of matrix, even non-squared ones.
You don't need another loop, one is enough. If you use randi to get a random integer as size of your submatrix, and then use those to get random column and row indices you can easily get a random submatrix. Do note that the ouput is a cell, as the submatrices won't all be of the same size.
N=102; % Or substitute with some size function
matt = rand(N); % Initial matrix, use your own
itr = 30; % Number of iterations
mattsub = cell(itr,1); % Cell for non-uniform output
for ii = 1:itr
X = randi(7)+1; % Get random integer between 2 and 7
colr = randi(N-X); % Random column
rowr = randi(N-X); % random row
mattsub{ii} = matt(rowr:(rowr+X-1),colr:(colr+X-1));
end
I've written a function that generates a sparse matrix of size nxd
and puts in each column 2 non-zero values.
function [M] = generateSparse(n,d)
M = sparse(d,n);
sz = size(M);
nnzs = 2;
val = ceil(rand(nnzs,n));
inds = zeros(nnzs,d);
for i=1:n
ind = randperm(d,nnzs);
inds(:,i) = ind;
end
points = (1:n);
nnzInds = zeros(nnzs,d);
for i=1:nnzs
nnzInd = sub2ind(sz, inds(i,:), points);
nnzInds(i,:) = nnzInd;
end
M(nnzInds) = val;
end
However, I'd like to be able to give the function another parameter num-nnz which will make it choose randomly num-nnz cells and put there 1.
I can't use sprand as it requires density and I need the number of non-zero entries to be in-dependable from the matrix size. And giving a density is basically dependable of the matrix size.
I am a bit confused on how to pick the indices and fill them... I did with a loop which is extremely costly and would appreciate help.
EDIT:
Everything has to be sparse. A big enough matrix will crash in memory if I don't do it in a sparse way.
You seem close!
You could pick num_nnz random (unique) integers between 1 and the number of elements in the matrix, then assign the value 1 to the indices in those elements.
To pick the random unique integers, use randperm. To get the number of elements in the matrix use numel.
M = sparse(d, n); % create dxn sparse matrix
num_nnz = 10; % number of non-zero elements
idx = randperm(numel(M), num_nnz); % get unique random indices
M(idx) = 1; % Assign 1 to those indices
Assuming that I have a dataset of the following size:
train = 500,000 * 960 %number of training samples (vector) each of 960 length
B_base = 1000000*960 %number of base samples (vector) each of 960 length
Query = 1000*960 %number of query samples (vector) each of 960 length
truth_nn = 1000*100
truth_nn contains the ground truth neighbors in the form of the
pre-computed k nearest neighbors and their square Euclidean distance. So, the columns of truth_nn represent the k = 100 nearest neighbors. I am finding difficult to apply nearest neighbor search in the code snippet. Can somebody please show how to apply the ground truth neighbors truth_nn in finding the mean average precision-recall?
It will be of immense help if somebody can show with any small example by creating any data matrix, query matrix, and the ground truth neighbors in the form of the pre-computed k nearest neighbors and their square Euclidean distance. I tried creating a sample database.
Assume, the base data is
B_base = [1 1; 2 2; 3 2; 4 4; 5 6];
Query data is
Query = [1 1; 2 1; 6 2];
[neighbors distances] = knnsearch(a,b,'k',2);
would find 2 nearest neighbors.
Question 1: how do I create the truth data containing the ground truth neighbors and pre-computed k nearest neighbor distances?
This is called as the mean average precision recall. I tried implementing the knearest neighbor search and the average precision recall as follows but cannot understand (unsure) how to apply the ground truth table
Question 2:
I am trying to apply k nearest neighbor search by converting first the real-valued features into binary.
I am unable to apply the concept of k-nearest neighbor search for different values of k = 10,20,50 and to check how much data has been correctly recalled using the GIST database. In the GIST truth_nn() file, when I specify truth_nn(i,1:k) for a query vector i, the function AveragePrecision throws error. So, if somebody can show using any sample ground truth that is of similar structure to that in GIST, how to properly specify k and calculate the Average precision recall, then I shall be able to apply the solution to the GIST database. As of now, this is my approach and shall be of immense help if the correct way is provided using any example that will be easier for me to relate to the GIST database. The problem is on how I can find neighbors from the ground truth and compare it with the neighbors obtained after sorting the distances?
I am also interested on how I can apply pdist2() instead of the present distance calcualtion as it takes a long time.
numQueryVectors = size(Query,1);
%Calculate distances
for i=1:numQueryVectors,
queryMatrix(i,:)
dist = sum((repmat(queryMatrix(i,:),numDataVectors,1)-B_base ).^2,2);
[sortval sortpos] = sort(dist,'ascend');
neighborIds(i,:) = sortpos(1:k);
neighborDistances(i,:) = sqrt(sortval(1:k));
end
%Sorting calculated nearest neighbor distances for k = 50
%HOW DO I SPECIFY k = 50 in the ground truth, truth_nn
for i=1:numQueryVectors
AP(i) = AveragePrecision(neighborIds(i,:),truth_nn(i,:));
end
mAP = mean(AP);
function ap = AveragePrecision(rank_id, truth_id)
truth_num = length(truth_id);
truth_pos = zeros(truth_num,1);
for j=1:50 %% for k = 50 nearest neighbors
truth_pos(j) = find(rank_id == truth_id(j));
end
truth_pos = sort(truth_pos, 'ascend');
% compute average precision as the area below the recall-precision curve
ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
p = j/truth_pos(j);
ap = ap + p*delta_recall;
end
end
end
UPDATE : Based on solution, I tried to calculate the mean average precision using the formula given formula hereand a reference code . But, not sure if my approach is correct because the theory says that I need to rank the returned queries based on the indices. I do not understand this fully. Mean average precision is required to judge the quality of the retrieval algortihm.
precision = positives/total_data;
recal = positives /(positives+negatives);
precision = positives/total_data;
recall = positives /(positives+negatives);
truth_pos = sort(positives, 'ascend');
truth_num = length(truth_pos);
ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
p = j/truth_pos(j);
ap = ap + p*delta_recall;
end
ap
The value of ap = infinity , value of positive = 0 and negatives = 150. This means that knnsearch() does not work at all.
I think you are doing extra work. This process is very simple in matlab, you can also operate on entire arrays. This should be faster than for loops, and is a bit easier to read.
Your truth_nn and neighbors should have the same data, if there are no errors. There is one entry per row. Matlab already sorts the kmeans result in ascending order, so the column 1 is the closest neighbor, the second closest is column 2, 3rd closest is 3,.... There is no need to sort the data again.
Just compare truth_nn to neighbors to get your statistics. This is a simple example to show you how the program should go. It will not work on your data without some modification
%in your example this is provided, I created my own
truth_nn = [1,2;
1,3;
4,3];
B_base = [1 1; 2 2; 3 2; 4 4; 5 6];
Query = [1 1; 2 1; 6 2];
%performs k means
num_clusters = 2;
[neighbors distances] = knnsearch(B_base,Query,'k',num_clusters);
%--- output---
% neighbors = [1,2;
% 1,2; notice this doesn't match truth_nn 1,3
% 4,3]
% distances = [ 0 1.4142;
% 1.0000 1.0000;
% 2.8284 3.0000];
%computes statistics, nnz counts number of nonzero elements, in the first
%case every piece of data that matches
%NOTE1: the indexing on truth_nn (:,1:num_clusters ) it says all rows
% but only use the first num_clusters columns. This should
% prevent the dimension mistmatch error you were getting
positives = nnz(neighbors == truth_nn(:,1:num_clusters )); %result = 5
negatives = nnz(neighbors ~= truth_nn(:,1:num_clusters )); %result = 1
%NOTE1: I've switched this from truth_nn to neighbors, this helps
% when you cahnge num_neghbors
total_data = numel(neighbors); %result = 6
percent_incorrect = 100*(negatives / total_data); % 16.6666
percent_correct = 100*(positives / total_data); % 93.3333
H matrix is n-by-n, n=10000. I can use loop to generate this matrix in matlab. I just wonder if there are any methods that can do this without looping in matlab.
You can see that the upper right portion of the matrix consists of 1 / sqrt(n*(n-1)), the diagonal elements consist of -(n-1)/sqrt(n*(n-1)), the first column consists of 1/sqrt(n) and the rest of the elements are zero.
We can generate the full matrix that consists of the first column having all 1 / sqrt(n), then having the rest of the columns with 1 / sqrt(n*(n-1)) then we'll need to modify the matrix to include the rest of what you want.
As such, let's concentrate on the elements that start from row 2, column 2 as these follow a pattern. Once we're done, we can construct the other things that build up the final matrix.
x = 2:n;
Hsmall = repmat([1./sqrt(x.*(x-1))], n-1, 1);
Next, we will tackle the diagonal elements:
Hsmall(logical(eye(n-1))) = -(x-1)./sqrt(x.*(x-1));
Now, let's zero the rest of the elements:
Hsmall(tril(logical(ones(n-1)),-1)) = 0;
Now that we're done, let's create a new matrix that pieces all of this together:
H = [1/sqrt(n) 1./sqrt(x.*(x-1)); repmat(1/sqrt(n), n-1, 1) Hsmall];
Therefore, the full code is:
x = 2:n;
Hsmall = repmat([1./sqrt(x.*(x-1))], n-1, 1);
Hsmall(logical(eye(n-1))) = -(x-1)./sqrt(x.*(x-1));
Hsmall(tril(logical(ones(n-1)),-1)) = 0;
H = [1/sqrt(n) 1./sqrt(x.*(x-1)); repmat(1/sqrt(n), n-1, 1) Hsmall];
Here's an example with n = 6:
>> H
H =
Columns 1 through 3
0.408248290463863 0.707106781186547 0.408248290463863
0.408248290463863 -0.707106781186547 0.408248290463863
0.408248290463863 0 -0.816496580927726
0.408248290463863 0 0
0.408248290463863 0 0
0.408248290463863 0 0
Columns 4 through 6
0.288675134594813 0.223606797749979 0.182574185835055
0.288675134594813 0.223606797749979 0.182574185835055
0.288675134594813 0.223606797749979 0.182574185835055
-0.866025403784439 0.223606797749979 0.182574185835055
0 -0.894427190999916 0.182574185835055
0 0 -0.912870929175277
Since you are working with a pretty large n value of 10000, you might want to squeeze out as much performance as possible.
Going with that, you can use an efficient approach based on cumsum -
%// Values to be set in each column for the upper triangular region
upper_tri = 1./sqrt([1:n].*(0:n-1));
%// Diagonal indices
diag_idx = [1:n+1:n*n];
%// Setup output array
out = zeros(n,n);
%// Set the first row of output array with upper triangular values
out(1,:) = upper_tri;
%// Set the diagonal elements with the negative triangular values.
%// The intention here is to perform CUMSUM across each column later on,
%// thus therewould be zeros beyond the diagonal positions for each column
out(diag_idx) = -upper_tri;
%// Set the first element of output array with n^(-1/2)
out(1) = -1/sqrt(n);
%// Finally, perform CUMSUM as suggested earlier
out = cumsum(out,1);
%// Set the diagonal elements with the actually expected values
out(diag_idx(2:end)) = upper_tri(2:end).*[-1:-1:-(n-1)];
Runtime Tests
(I) With n = 10000, the runtime at my end were - Elapsed time is 0.457543 seconds.
(II) Now, as the final performance-squeezing practice, you can edit the pre-allocation step for out with a faster pre-allocation scheme as listed in this MATLAB Undodumented Blog. Thus, the pre-allocation step would look like this -
out(n,n) = 0;
The runtime with this edited code was - Elapsed time is 0.400399 seconds.
(III) The runtime for n = 10000 with the other answer by #rayryeng yielded - Elapsed time is 1.306339 seconds.
I have two lists of 2-dimensional points given as M x 2 - and N x 2 - matrices, respectively, with M and N possibly being very large.
What is the fastest way to determine how many of them are equal?
I am not sure whether you want to count repetitive entries, but if not you could use intersect or some quite intuitive algorithm based on sorting (see below). I would not prefer a nested-loop version...
function test_compareVecs()
%% create some random data
N = 31415;
M1 = 100000;
M2 = 200000;
vec = rand(N,2);
v1 = [rand(M1-N,2); vec];
v2 = [rand(M2-N,2); vec];
v1 = v1(randperm(M1),:);
v2 = v2(randperm(M2),:);
%% intersect
disp('intersect:');
tic
s = size(intersect(v1,v2,'rows'),1);
toc;
s
%% alternative approach
disp('alternative approach:');
tic;
s = compareVecs(v1,v2);
toc;
s
end
function s = compareVecs(v1,v2)
%% create help vector
help_vec = [[v1,zeros(size(v1,1),1)]; ...
[v2,ones(size(v2,1),1)]];
%% sort by first column
% note: for some reason "sortrows(help_vec,1)" is slower
hash_vec = help_vec(:,1); % dummy hash
[~,sidx] = sort(hash_vec);
help_vec = help_vec(sidx,:);
%% diff + compare
help_vec = diff(help_vec);
s = sum(help_vec(:,1) == 0 & ...
help_vec(:,2) == 0 & ...
help_vec(:,3) ~= 0);
end
Result
intersect:
Elapsed time is 0.145717 seconds.
s = 31415
alternative approach:
Elapsed time is 0.048084 seconds.
s = 31415
Compute all pair-wise distances with pdist2 and then count pairs with zero distance. If the coordinates are float values, you may want to use a tolerance instead of comparing against zero:
%// Data:
M = 10;
N = 8;
listM = randi(10,M,2)-1;
listN = randi(10,N,2)-1;
tol = 1e-6;
%// Distance matrix:
d = pdist2(listM, listN);
%// Count:
count = sum(d(:)<tol);
This should work irrespective of the order of the points in each list, or their lengths. It is a hash-table/dictionary solution that should be fast but with memory demand linear with the lengths of the lists. Please, note that the syntax below may not be perfect, but a quick reference to the main data structures mentioned should make corrections trivial.
(1) populate a dictionary-like containers.Map, in a way that the key is a unique function of the points, e.g. num2str(M(i,1))'-'num2str(M(i,2)).
(2) Then, go over all elements of the second list, create the key just as in (1) and check if it exists. If it does, set map(key)=1 else set it to 0. In the end, all the keys consisting of common points will have 1s stored, and the rest will be zeros.
(3) Finalize by summing over the values of the map (something like sum(map.values())) which should give you the total number of unique intersections among the two sets, irrespective of the order these points appear in each list.
OBS: if you don't want to count just unique intersections but all repeated points, in (2), rather than making map(key)=1, add 1 to map(key). The rest is the same.