matlab: duplicate rows removal [duplicate] - matlab

This question already has answers here:
matlab: remove duplicate values
(3 answers)
Closed 9 years ago.
I had a question and got an answer yesterday about removing doubling rows in a matrix, and I can't figure out why it omits certain rows in a matrix.
With a matrix:
tmp2 =
0 1.0000
0.1000 1.0000
0.2000 1.0000
0.3000 1.0000
0.3000 2.0000
0.4000 2.0000
0.5000 2.0000
0.6000 2.0000
0.7000 2.0000
0.7000 3.0000
0.8000 3.0000
0.9000 3.0000
1.0000 3.0000
1.1000 3.0000
1.2000 3.0000
I need to remove the rows:
0.3000 2.0000
0.7000 3.0000
I tried to do it with
[~,b] = unique(tmp2(:,1));
tmp2(b,:)
I wrote something on my own
tmp3 = [];
for i=1:numel(tmp2(:,1))-1
if tmp2(i,1) == tmp3
tmp2(i,:) = [];
end
tmp3 = tmp2(i,1);
end
But all of the methods seem to omit the first row to remove... Please help, as I already spent some hours trying to fix it myself (I suck at programming...) and nothings seems to work. The matrix is an example, but generally if two rows have the same value in the first column I have to remove the second one

You were on the right track...
tmp2 = [...
0 1
1 1
2 3
2 5
3 5
4 7
5 4
5 8
6 1
];
Now call unique like you did, but use the flag first to grab the first unique:
[~,li]=unique(tmp2(:,1),'first');
tmp_unique = tmp2(li,:);

Related

Matrix being divided by a value unexpectedly

Just working on multiply 2 rows of a matrix and making the answer another row within the same matrix. However when putting the new row in the matrix all of the values are for some reason divided by 1.0e+03.
X = (M(1,1:5) .* M(2,1:5));
M(3,1:5) = X(1,1:5)
disp(X)
Actual Result
X =
1.0e+03 *
0.4000 0.5500 0.7000 0.5000 0.6000
0.0030 0.0005 0.0008 0.0015 0.0050
1.2000 0.2750 0.5250 0.7500 3.0000
Expected Result
400.0000 550.0000 700.0000 500.0000 600.0000
3.0000 0.5000 0.7500 1.5000 5.0000
1200 275 525 750 3000

Matlab - Plot from double data

From an N x 3 matrix with values in third column only having the 4 values 1, 2, 3 & 4, I have to create a bar plot as well as a line plot showing the growth rate for each row (second column values). First row values are called 'Temperature', second 'Growth rate' and third 'Bacteria type'
As of right now my line plot does not work when removing rows with one of the four values in the third column. The matrix could look something like this
39.1220 0.8102 1.0000
13.5340 0.5742 1.0000
56.1370 0.2052 1.0000
50.0190 0.4754 1.0000
24.2970 0.8615 1.0000
37.1830 0.8513 1.0000
59.2390 0.0584 1.0000
45.7840 0.6254 1.0000
51.9480 0.3932 1.0000
42.3400 0.7371 1.0000
25.3870 0.8774 1.0000
57.1870 0.3880 2.0000
37.4580 0.7095 2.0000
46.4190 0.6431 2.0000
38.8380 0.7034 2.0000
11.2930 0.1214 2.0000
32.3270 0.6708 2.0000
42.3150 0.6908 2.0000
36.0600 0.7049 2.0000
28.6160 0.6248 2.0000
56.8570 0.3940 2.0000
51.4770 0.5410 2.0000
52.4540 0.5127 2.0000
28.6270 0.6248 2.0000
39.6590 0.7021 2.0000
53.6280 0.4829 2.0000
56.6750 0.4029 2.0000
43.4230 0.6805 2.0000
20.3390 0.4276 2.0000
42.6930 0.6826 2.0000
13.6030 0.2060 2.0000
30.3360 0.6497 2.0000
43.3470 0.6749 2.0000
56.6860 0.3977 2.0000
50.5480 0.5591 2.0000
34.2270 0.6929 2.0000
47.8370 0.6136 2.0000
30.8520 0.6593 2.0000
51.3290 0.5050 3.0000
29.5010 0.7789 3.0000
34.8950 0.8050 3.0000
44.7400 0.6884 3.0000
51.7180 0.4927 3.0000
40.4810 0.7621 3.0000
38.7370 0.7834 3.0000
26.3020 0.7379 3.0000
32.8210 0.8072 3.0000
45.6900 0.6684 3.0000
54.2200 0.4058 3.0000
46.0430 0.6611 3.0000
10.9310 0.2747 3.0000
43.7390 0.7043 3.0000
31.9250 0.7948 3.0000
31.8910 0.7954 3.0000
15.8520 0.4592 3.0000
50.7340 0.5237 3.0000
26.2430 0.7305 3.0000
22.3110 0.6536 3.0000
14.7690 0.1796 4.0000
17.3260 0.2304 4.0000
41.5570 0.3898 4.0000
52.9660 0.2604 4.0000
58.7110 0.1558 4.0000
And my code is as follows, with data being the matrix (double)
function dataPlot(data)
%1xN matrix with bacteria
A=data(:,3);
%Number of different bacterias is counted, and gathered in a vector
barData = [sum(A(:) == 1), sum(A(:) == 2), sum(A(:) == 3), sum(A(:) == 4)];
figure
bar(barData);
label = {'Salmonella enterica'; 'Bacillus cereus'; 'Listeria'; 'Brochothrix thermosphacta'};
set(gca,'xtick',[1:4],'xticklabel',label)
set(gca,'XTickLabelRotation',45)
ylabel('Observations')
title('Destribution of bacteria')
%The data is divided into four matrices based on the four different bacterias
%Salmonella matrix
S=data;
deleterow = false(size(S, 1), 1);
for n = 1:size(S, 1)
%For column condition
if S(n, 3)~= 1
%Mark line for deletion afterwards
deleterow(n) = true;
end
end
S(deleterow,:) = [];
S=S(:,1:2);
S=sortrows(S,1);
%Bacillus cereus
Ba=data;
deleterow = false(size(Ba, 1), 1);
for p = 1:size(Ba, 1)
%For column condition
if Ba(p, 3)~= 2
%Mark line for deletion afterwards
deleterow(p) = true;
end
end
Ba(deleterow,:) = [];
Ba=Ba(:,1:2);
Ba=sortrows(Ba,1);
%Listeria
L=data;
deleterow = false(size(L, 1), 1);
for v = 1:size(L, 1)
%For column condition
if L(v, 3)~= 3
%Mark line for deletion afterwards
deleterow(v) = true;
end
end
L(deleterow,:) = [];
L=L(:,1:2);
L=sortrows(L,1);
%Brochothrix thermosphacta
Br=data;
deleterow = false(size(Br, 1), 1);
for q = 1:size(Br, 1)
%For column condition
if Br(q, 3)~= 3
%Mark line for deletion afterwards
deleterow(q) = true;
end
end
Br(deleterow,:) = [];
Br=Br(:,1:2);
Br=sortrows(Br,1);
%The data is plotted (growth rate against temperature)
figure
plot(S(:,1), S(:, 2), Ba(:,1), Ba(:, 2), L(:,1), L(:, 2), Br(:,1), Br(:, 2))
xlim([10 60])
ylim([0; Inf])
xlabel('Temperature')
ylabel('Growth rate')
title('Growth rate as a function of temperature')
legend('Salmonella enterica','Bacillus cereus','Listeria','Brochothrix thermosphacta')
Can anyone help me fix it so when I have a matrix without eg 2 in the third column, it will still plot correctly?
I do know how to filter it correctly, and apply that filtering to 'data', so the only problem is the error codes occurring when plotting.
The errors are;
Warning: Ignoring extra legend entries.
> In legend>set_children_and_strings (line 643)
In legend>make_legend (line 328)
In legend (line 254)
In dataPlot (line 82)
In Hovedscript (line 153)
When running this function from a main script with the matrix sorted by growth rate (second column) and filtering for only third row values 1, 3 and 4 to be analyzed. The filtering is done through another function, done in advance, and the new 'data' looks like this.
59.2390 0.0584 1.0000
58.7110 0.1558 4.0000
14.7690 0.1796 4.0000
56.1370 0.2052 1.0000
17.3260 0.2304 4.0000
52.9660 0.2604 4.0000
10.9310 0.2747 3.0000
41.5570 0.3898 4.0000
51.9480 0.3932 1.0000
54.2200 0.4058 3.0000
15.8520 0.4592 3.0000
50.0190 0.4754 1.0000
51.7180 0.4927 3.0000
51.3290 0.5050 3.0000
50.7340 0.5237 3.0000
13.5340 0.5742 1.0000
45.7840 0.6254 1.0000
22.3110 0.6536 3.0000
46.0430 0.6611 3.0000
45.6900 0.6684 3.0000
44.7400 0.6884 3.0000
43.7390 0.7043 3.0000
26.2430 0.7305 3.0000
42.3400 0.7371 1.0000
26.3020 0.7379 3.0000
40.4810 0.7621 3.0000
29.5010 0.7789 3.0000
38.7370 0.7834 3.0000
31.9250 0.7948 3.0000
31.8910 0.7954 3.0000
34.8950 0.8050 3.0000
32.8210 0.8072 3.0000
39.1220 0.8102 1.0000
37.1830 0.8513 1.0000
24.2970 0.8615 1.0000
25.3870 0.8774 1.0000
Again, the bar plot works just fine, but does show all 4 bacteria even when only 3 of them are used, and the problem is in the line plot, with one line not showing in the plot.
Thank you for your time
A solution is to replace the following lines
plot(S(:,1), S(:, 2), Ba(:,1), Ba(:, 2), L(:,1), L(:, 2), Br(:,1), Br(:, 2))
legend('Salmonella enterica','Bacillus cereus','Listeria','Brochothrix thermosphacta')
by
hold on
plot(S(:,1), S(:, 2), 'DisplayName', 'Salmonella enterica');
plot(Ba(:,1), Ba(:, 2), 'DisplayName', 'Bacillus cereus');
plot(L(:,1), L(:, 2), 'DisplayName', 'Listeria');
plot(Br(:,1), Br(:, 2), 'DisplayName', 'Brochothrix thermosphacta');
legend SHOW;
In this way, the legend entries are explitely assigned to a specific plot, which works even if some plots are empty.
Copy and Paste mistake
L == Br in the provided code due to a copy and paste mistake. You should change if Br(q, 3)~= 3 into if Br(q, 3)~= 4.
Result
If I use your second input data (without 2 in the third column), I get the following (without any error message):

Matlab: Removing duplicate interactions [duplicate]

This question already has answers here:
How can I find unique rows in a matrix, with no element order within each row?
(4 answers)
Closed 7 years ago.
I have a Protein-Protein interaction data of homo sapiens. The size of the matrix is <4850628x3>. The first two columns are proteins and the third is its confident score. The problem is half the rows are duplicate pairs
if protein A interacts with B, C, D. it is mentioned as
A B 0.8
A C 0.5
A D 0.6
B A 0.8
C A 0.5
D A 0.6
If you observe the confident score of A interacting with B and B interacting with A is 0.8
If I have a matrix of <4850628x3> half the rows are duplicate pairs. If I choose Unique(1,:) I might loose some data.
But I want <2425314x3> i.e without duplicate pairs. How can I do it efficiently?
Thanks
Naresh
Supposing that in your matrix you store each protein with a unique id.
(Eg: A=1, B=2, C=3...) your example matrix will be:
M =
1.0000 2.0000 0.8000
1.0000 3.0000 0.5000
1.0000 4.0000 0.6000
2.0000 1.0000 0.8000
3.0000 1.0000 0.5000
4.0000 1.0000 0.6000
You must first sort the two first columns row-wise so you will always have the protein pairs in the same order:
M2 = sort(M(:,1:2),2)
M2 =
1 2
1 3
1 4
1 2
1 3
1 4
Then use unique with the second parameter rows and keep the indexes of unique pairs:
[~, idx] = unique(M2, 'rows')
idx =
1
2
3
Finally filter your initial matrix to keep unly the unique pairs.
R = M(idx,:)
R =
1.0000 2.0000 0.8000
1.0000 3.0000 0.5000
1.0000 4.0000 0.6000
Et voilĂ !

Adding data in intervals in Matlab

Hi I have data in MATLAB like this:
F =
1.0000 1.0000
2.0000 1.0000
3.0000 1.0000
3.1416 9.0000
4.0000 1.0000
5.0000 1.0000
6.0000 1.0000
6.2832 9.0000
7.0000 1.0000
8.0000 1.0000
9.0000 1.0000
9.4248 9.0000
10.0000 1.0000
I am looking for a way to sum the data in specific intervals. Example if I want my sampling interval to be 1, then the end result should be:
F =
1.0000 1.0000
2.0000 1.0000
3.0000 10.0000
4.0000 1.0000
5.0000 1.0000
6.0000 10.0000
7.0000 1.0000
8.0000 1.0000
9.0000 10.0000
10.0000 1.0000
i.e data is accumulated in the second column based on sampling the first row. Is there a function in MATLAB to do this?
Yes by combining histc() and accumarray():
F =[1.0000 1.0000;...
2.0000 1.0000;...
3.0000 1.0000;...
3.1416 9.0000;...
4.0000 1.0000;...
5.0000 1.0000;...
6.0000 1.0000;...
6.2832 9.0000;...
7.0000 1.0000;...
8.0000 1.0000;...
9.0000 1.0000;...
9.4248 9.0000;...
10.0000 1.0000];
range=1:0.5:10;
[~,bin]=histc(F(:,1),range);
result= [range.' accumarray(bin,F(:,2),[])]
If you run the code keep in mind that I changed the sampling interval (range) to 0.5.
This code works for all sampling intervals just define your wanted interval as range.
Yes and that's a job for accumarray:
Use the values in column 1 of F to sum (default behavior of accumarray) the elements in the 2nd column.
For a given interval of size s (Thanks to Luis Mendo for that):
S = accumarray(round(F(:,1)/s),F(:,2),[]); %// or you can use "floor" instead of "round".
S =
1
1
10
1
1
10
1
1
10
1
So constructing the output by concatenation:
NewF = [unique(round(F(:,1)/s)) S]
NewF =
1 1
2 1
3 10
4 1
5 1
6 10
7 1
8 1
9 10
10 1
Yay!!

Indexing during assignment

Say I have this sample data
A =
1.0000 6.0000 180.0000 12.0000
1.0000 5.9200 190.0000 11.0000
1.0000 5.5800 170.0000 12.0000
1.0000 5.9200 165.0000 10.0000
2.0000 5.0000 100.0000 6.0000
2.0000 5.5000 150.0000 8.0000
2.0000 5.4200 130.0000 7.0000
2.0000 5.7500 150.0000 9.0000
I wish to calculate the variance of each column, grouped by class (the first column).
I have this working with the following code, but it uses hard coded indices, requiring knowledge of the number of samples per class and they must be in specific order.
Is there a better way to do this?
variances = zeros(2,4);
variances = [1.0 var(A(1:4,2)), var(A(1:4,3)), var(A(1:4,4));
2.0 var(A(5:8,2)), var(A(5:8,3)), var(A(5:8,4))];
disp(variances);
1.0 3.5033e-02 1.2292e+02 9.1667e-01
2.0 9.7225e-02 5.5833e+02 1.6667e+00
Separate the class labels and the data into different variables.
cls = A(:, 1);
data = A(:, 2:end);
Get the list of class labels
labels = unique(cls);
Compute the variances
variances = zeros(length(labels), 3);
for i = 1:length(labels)
variances(i, :) = var(data(cls == labels(i), :)); % note the use of logical indexing
end
I've done a fair bit of this type of stuff over the years, but to be able to judge, better vs. best, it would help to know what you expect to change in the data set or structure.
Otherwise, if no change is anticipated and the hard code works, stick with it.
Easy, peasy. Use consolidator. It is on the file exchange.
A = [1.0000 6.0000 180.0000 12.0000
1.0000 5.9200 190.0000 11.0000
1.0000 5.5800 170.0000 12.0000
1.0000 5.9200 165.0000 10.0000
2.0000 5.0000 100.0000 6.0000
2.0000 5.5000 150.0000 8.0000
2.0000 5.4200 130.0000 7.0000
2.0000 5.7500 150.0000 9.0000];
[C1,var234] = consolidator(A(:,1),A(:,2:4),#var)
C1 =
1
2
var234 =
0.035033 122.92 0.91667
0.097225 558.33 1.6667
We can test the variances produced, since we know the grouping.
var(A(1:4,2:4))
ans =
0.035033 122.92 0.91667
var(A(5:8,2:4))
ans =
0.097225 558.33 1.6667
It is efficient too.