I have got a matrix Nx2 which contains edges from a graph. Indexes of the matrix correspond to ids of twitter users. Their relation is the retweeted status(if a user retweets another user). Totally in my graph there exists N retweeted relations. The number of users are M. I want to transform the ids of the graph from the initial twitter ids to 1:M ids. For example to replace the first id of the graph with 1(in every line and column that exists). I want to do so, without changing again the id which already have been changed. I tried to use a for-loop combined with find function in order to tranform ids to index. However what should I do in order to avoid changing items that already have been changed? I know that my code is wrong:
counter = 0;
for index = 1:length(grph)
index1 = find(grph(:,1) == grph(index,1));
index2 = find(grph(:,2) == grph(index,2));
counter = counter+1;
grph(index1,1) = counter;
counter = counter+1;
grph(index2,2) = counter;
end
A little example which illustrates what I want, the following:
35113 45010
5695 57711
22880 33193
22880 45010
43914 35113
Desired output :
1 2
3 4
5 6
5 2
7 1
Pretty simple. Use a combination of unique and reshape. Assuming your ID matrix in your example was stored in A:
[~,~,B] = unique(A.', 'stable');
C = reshape(B, [size(A,2) size(A,1)]).';
A would be the matrix of IDs while C is your desired output. How this works is that unique's third output would give you an array of unique IDs for each value that is encountered in A. The reason why we need to transpose the result first is because MATLAB operates along the columns, and your result needs to operate along the rows. Transposing the result effectively does this. Also, you need to the 'stable' flag so that we assign IDs in the order we encounter them. Not doing 'stable' will sort the values in A first, then assign the IDs.
B will inevitably become a column vector, and so we need to reshape this back into a matrix that is the same size as your input A. Note that I need to reshape by the transpose of the result as reshape will operate among the columns. Because we were operating along the rows, I need to reshape the matrix by its transpose, and then transpose that result to get your desired output.
Example use:
A = [35113 45010
5695 57711
22880 33193
22880 45010
43914 35113]; %// Matrix defined by you
[~,~,B] = unique(A.', 'stable');
C = reshape(B, [size(A,2) size(A,1)]).';
C =
1 2
3 4
5 6
5 2
7 1
However, if sorting the IDs isn't required and you just want to have IDs per node ID, then you can just use unique as is without the stable flag.
Now, if you want to know which IDs from your graph got assigned to which IDs in the output matrix, just use the first output of unique:
[mapping, ~, B] = unique(A.', 'stable');
mapping will give you a list of all unique IDs that were encountered in your matrix. Their position identifies what ID was used to assign them into B. In other words, running this, we get:
mapping =
35113
45010
5695
57711
22880
33193
43914
This means that ID 35113 in A gets mapped to 1 in B, ID 45010 in A gets mapped to 2 in B and so on. As a more verbose illustration:
mappings = [(1:numel(mapping)).' mapping]
mappings =
1 35113
2 45010
3 5695
4 57711
5 22880
6 33193
7 43914
I can't test right now, but this should do what you want:
[~, ~, kk] = unique(A.','stable');
result = reshape(kk, fliplr(size(A))).';
You need a recent enough Matlab version, so that unique has the 'stable' option.
If you have the Communications Toolbox, the second line could be replaced by
result = vec2mat(kk, size(A,2));
Related
I want to define a recursive function that sorts an input vector and uses a sequence of secondary vectors to break any ties (or randomises them if it runs out of tiebreak vectors)
Given some input vector I and some tiebreaker matrix T, the pseudocode for the algorithm is as follows:
check if T is empty, if so, we reached stopping condition, therefore randomise input
get order of indices for sorted I, using matlab's standard sort function
find indices of duplicate values
for each duplicate value,
call function recursively on T(:,1) with rows corresponding to the indices of that duplicate value, with T(:,2:end)(with appropriate rows) as the new tiebreaker matrix - if empty then this call will just return random indices
fix the order of the sorted indices in the original sorted I
return the sorted I and corresponding indices
Here is what I have so far:
function [vals,idxs] = tiebreak_sort(input, ties)
% if the tiebreak matrix is empty, then return random
if isempty(ties)
idxs = randperm(size(input,1));
vals = input(idxs);
return
end
% sort the input
[vals,idxs] = sort(input);
% check for duplicates
[~,unique_idx] = unique(vals);
dup_idx = setdiff(1:size(vals,1),unique_idx);
% iterate over each duplicate index
for i = 1:numel(dup_idx)
% resolve tiebreak for duplicates
[~,d_order] = tiebreak_sort(ties(input==input(i),1),...
ties(input==input(i),2:end));
% fix the order of sorted indices (THIS IS WHERE I AM STUCK)
idxs(vals==input(i)) = ...?
end
return
I want to find a way to map the output of the recursive call, to the indices in idxs, to fix their order based on the (possibly recursive) tie breaks, but my brain is getting twisted in knots thinking about it..
Can I just use the fact that Matlabs sort function is stable and preserves the original order, and do it like this?
% find indices of duplicate values
dups = find(input==input(i));
% fix the order of sorted indices
idxs(vals==input(i)) = dups(d_order);
Or will that not work? is there another way of doing what I am trying to do, in general?
Just to give a concrete example, this would be a sample input:
I = [1 2 2 1 2 2]'
T = [4 1 ;
3 7 ;
3 4 ;
2 2 ;
1 8 ;
5 3 ]
and the output would be:
vals = [1 1 2 2 2 2]'
idxs = [4 1 5 3 2 6]'
Here, there are clearly duplicates in the input, so the function is called recursively on the first column of the tiebreaker matrix, which was able to fix the 1s but it needed a second recursive call on the 3s of the first column to break those ties.
No need to define a function, sortrows does that:
[S idxs] = sortrows([I T]);
vals = S(:,1);
I have this code:
A = [3,1,5,8]
B = [0, 0]
indexB = [1,2,2,1]
for i = 1:4
B(indexB(i)) = B(indexB(i)) + A(i)
end
So, in the end, I got
B = [11, 6]
I wonder if I can use a more efficient way to sum up instead of using the for-loop?
Classic use of accumarray. Only this time, you accumulate the entries in A then add this on top of B as B is the starting point of the summation:
B = B(:); % Force into columns
B = B + accumarray(indexB(:), A(:));
How accumarray works is quite simple. You can think of it as a miniature MapReduce paradigm. Simply put, for each data point we have, there is a key and an associated value. The goal of accumarray is to place (or bin) all of the values that belong to the same key and do some operation on all of these values. In our case, the "key" would be the values in indexB where each element is a location to index into B. The values themselves are those from A. We would then want to add up all of the values that belong to each location in indexB together. Thankfully, the default behaviour for accumarray is to add all of these values. Specifically, the output of accumarray would be an array where each position computes the sum of all values that mapped to a key. For example, the first position would be the summation of all values that mapped to the key of 1, the second position would be the summation of all values that mapped to the key of 2 and so on.
Because you are using B as a starting point, the end result would be to take the summation result from accumarray and add this on top of B thus completing the code.
Minor Note
I do have to point out that accumarray works by columns. Because you are using rows, I had to force the input so that they are columns, which is the purpose of the (:) syntax. The output will also be as a column so you can transpose that if you wish to have it in a row format.
I have two vectors of length 16. The first one, r, for example is:
r = [1;3;5;7;1;3;6;7;9;11;13;16;9;11;13;16];
r contains a list of IDs. I want to collect the indices of the duplicate IDs in r so that each group is a list of indices for one ID. I would then use these indices to access a second vector a and find the maximum value incident on the indices for each group.
Therefore, I would like to produce an output vector using r and a such that:
max(a(1),a(5)), max(a(2),a(6)), a(3), a(7), max(a(4),a(8)), max(a(9),a(13)), max(a(10),a(14)), max(a(11),a(15)), max(a(12),a(16))
I also want to keep the indices of the maximum values. How can I efficiently implement this in MATLAB?
You can use the third output of unique to assign each unique number in r a unique ID. You can then bin all of the numbers that share the same ID with an accumarray call where the key is the unique ID and the value is the actual value of a for the corresponding position of the key in this unique ID array. Once you collect all of these values, use accumarray so that you can use these values for each unique value in r to reference into a and select out the maximum element:
%// Define r and a
r = [1;3;5;7;1;3;6;7;9;11;13;16;9;11;13;16];
a = [...];
%// Relevant code
[~,~,id] = unique(r, 'stable');
out = accumarray(id(:), a(:), [], #max);
The 'stable' flag in unique is important because we want to assign unique IDs in order of occurrence. Not doing this will sort the values in r first before assigning IDs and that's not what we want.
Here's a quick example. Let me set up your problem with generating a random 16 element array stored in a which you are trying to ultimately index. We'll also set up r:
rng(123);
a = rand(16,1);
r = [1;3;5;7;1;3;6;7;9;11;13;16;9;11;13;16];
This is what a looks like:
>> a
a =
0.6965
0.2861
0.2269
0.5513
0.7195
0.4231
0.9808
0.6848
0.4809
0.3921
0.3432
0.7290
0.4386
0.0597
0.3980
0.7380
After running through the code, we get this:
out =
0.7195
0.4231
0.2269
0.6848
0.9808
0.4809
0.3921
0.3980
0.7380
You can verify for yourself that this gives the right result. Specifically, the first element is the maximum of a(1) and a(5) which is 0.6965 and 0.7195 respectively, and the maximum is 0.7195. Similarly, the second element is the maximum a(2) and a(6), which is 0.2861 and 0.4231, and the maximum is 0.4231 and so on.
If it is your desire to also remember what the indices were used to select out the maximum element, this will be slightly more complicated. What you need to do is call accumarray once again, but the values won't be those of a but the actual index values instead. You'd use the second output of max to get the actual location of the value chosen. However, with the nature of max, we can't just grab the second element of max without explicitly calling the two-output version of max (I really wish there was another way around this... Python has a function in NumPy called numpy.argmax) and this can't be properly encapsulated in an anonymous function (i.e. #(x) ...), so you're going to need to create a custom function to do that.
Create a new function called maxmod and save it to a file called maxmod.m. You'd put this inside the function:
function p = maxmod(vals, ind)
[~,ii] = max(vals(ind));
p = ind(ii);
This takes in an array and a range of indices to access the array, called vals. We'd then find the maximum of these selected results, then return which index gave us the maximum.
After, you'd call accumarray like so:
%// Define r and a
r = [1;3;5;7;1;3;6;7;9;11;13;16;9;11;13;16];
a = [...];
%// Relevant code
[~,~,id] = unique(r, 'stable');
out = accumarray(id(:), (1:numel(r)).', [], #(x) maxmod(a,x));
This is now what I get:
>> out
out =
5
6
3
8
7
9
10
15
16
If you look at each value, this reflects which location of a we chose that corresponds to the maximum of each group.
I apologize for the formatting and what seems like a very easy question. I am new to matlab and this stack exchange. I am attempting to create an adjacency matrix from a few column vectors in matlab. The information was imported from a text file. The information looks like this.
X Y Z W
aa bb 1 aa
bb cc 2 bb
cc dd 3 cc
Where columns X and Y are the names of the vertex columns. Z is the weight. Columns X and Y have about 30000 entries, with repetition. Column W is all of the vertices in my graph sorted alphabetically without repetition.
The output should look like this for the sample data.
aa bb cc dd
aa 0 1 0 0
bb 1 0 2 0
cc 0 2 0 3
dd 0 0 3 0
I know how to create the matrix if the vertices are numerical. But I can't figure out how to assign numeric values to the vertices in column W and make everything still match up.
This code will work if the values in all the columns are numerical.
A = sparse([X; Y],[Y; X],[Z; Z]);
Where X, Y and Z are the columns above. When I try this with I get the following error
'Undefined function 'sparse' for input arguments of type 'cell'
You can still use sparse but you're going to have to do a bit more work. For one thing, we need to transform the labels in X and Y into unique integer IDs. Try using unique on the combined X and Y inputs so that you can get unique integer IDs shared between both.
Specifically, unique will give you a list of all unique entries of the input (so X and Y combined). The reason why we combine both X and Y is because there are certain tokens in X that may not be present in Y and vice-versa. Doing this ID assigning on the combined input will ensure consistency. The 'stable' flag is there because unique actually sorts all of the unique entries by default. If the input is a cell array of strings, the cell array is sorted in lexicographical order. If you want to maintain the order in which unique entries are encountered starting from the beginning to the end of the cell array, you use the 'stable' flag.
Next, what I would use is an associative array via a containers.Map that maps a string to a unique integer. Think of an associative array as a dictionary where the input is a key and the output is a value that is associated with this key. The best example of an associative array in this context would be the English dictionary. The key in this case is the word you want to look up, and the value is the definition of this word. The key is a character string, and the output is another character string.
Here, what we'll do is make the input a string and the output a single number. For each unique string we encountered with the combination of X and Y, we'll assign a unique ID to it. After that, we can use X and Y as inputs into the containers.Map to get our IDs which can then be used as input into sparse.
Without further ado, here's the code:
%// Your example
X = {'aa', 'bb', 'cc'};
Y = {'bb', 'cc', 'dd'};
Z = [1 2 3];
%// Call unique and get the unique entries
chars = unique([X Y], 'stable');
%// Create containers.Map
map = containers.Map(chars, 1:numel(chars));
%// Find the IDs for each of X and Y
idX = cell2mat(values(map, X)).';
idY = cell2mat(values(map, Y)).';
%// Create sparse matrix
A = sparse([idX; idY], [idY; idX], [Z; Z]);
The third and second last lines of code are a bit peculiar. You need to use the values function to retrieve the values given a cell array of keys. We have X and Y as both cell arrays, and so the output is also a cell array of values. We don't want this to be a cell array but to be a numerical vector instead as input into sparse, so that's why we use cell2mat to convert this back for us. Once we finally retrieve the IDs for X and Y, we put this into sparse to complete the matrix.
When we display the full version of A, we get:
>> full(A)
ans =
0 1 0 0
1 0 2 0
0 2 0 3
0 0 3 0
Minor Note
I see that W is the cell array of the vertex names sorted and in alphabetical order. If that's the case, then you don't need to do any unique calling, and you can just use W as the input into the containers.Map. As such, do this:
%// Create containers.Map
map = containers.Map(W, 1:numel(W));
%// Find the IDs for each of X and Y
idX = cell2mat(values(map, X)).';
idY = cell2mat(values(map, Y)).';
%// Create sparse matrix
A = sparse([idX; idY], [idY; idX], [Z; Z]);
I need help with taking the following data which is organized in a large matrix and averaging all of the values that have a matching ID (index) and outputting another matrix with just the ID and the averaged value that trail it.
File with data format:
(This is the StarData variable)
ID>>>>Values
002141865 3.867144e-03 742.000000 0.001121 16.155089 6.297494 0.001677
002141865 5.429278e-03 1940.000000 0.000477 16.583748 11.945627 0.001622
002141865 4.360715e-03 1897.000000 0.000667 16.863406 13.438383 0.001460
002141865 3.972467e-03 2127.000000 0.000459 16.103060 21.966853 0.001196
002141865 8.542932e-03 2094.000000 0.000421 17.452007 18.067214 0.002490
Do not be mislead by the examples I posted, that first number is repeated for about 15 lines then the ID changes and that goes for an entire set of different ID's, then they are repeated as a whole group again, think first block of code = [1 2 3; 1 5 9; 2 5 7; 2 4 6] then the code repeats with different values for the columns except for the index. The main difference is the values trailing the ID which I need to average out in matlab and output a clean matrix with only one of each ID fully averaged for all occurrences of that ID.
Thanks for any help given.
A modification of this answer does the job, as follows:
[value_sort ind_sort] = sort(StarData(:,1));
[~, ii, jj] = unique(value_sort);
n = diff([0; ii]);
averages = NaN(length(n),size(StarData,2)); % preallocate
averages(:,1) = StarData(ii,1);
for col = 2:size(StarData,2)
averages(:,col) = accumarray(jj,StarData(ind_sort,col))./n;
end
The result is in variable averages. Its first column contains the values used as indices, and each subsequent column contains the average for that column according to the index value.
Compatibility issues for Matlab 2013a onwards:
The function unique has changed in Matlab 2013a. For that version onwards, add 'legacy' flag to unique, i.e. replace second line by
[~, ii, jj] = unique(value_sort,'legacy')