How to speed up this MATLAB code in the loop? - matlab

I have a matrix with 1200000 rows and 18 columns. It is traffic data and each row is unique and the first column is the "vehicle ID", the second is the "frame ID" and the 16th column shows the "front vehicle ID". I want MATLAB to find the row vector of the front vehicle from the whole data in the same frame ID and place it in another matrix called PV. Also if there is no car in front, place zero vector. The whole matrix is called "H1". I used the code below and for the work I have used percentage. However, its runtime is too slow and takes more than 14 hours on a 16GB memory. It is too long for me since I have 10 other such data. Please help me to make it faster and better.
Thanks in advance.
for i=1:1200000
i*100/m % Shows what percent of the work done
g = H1(H1(:,1)==H1(i,16),:);
g = g(g(:,2)==H1(i,2),:);
if isempty(g)==1
PV(i,:) = zero(1,:);
else
PV(i,[1:17])=g(1,[1:17]);
end
end
EDIT: The data is like a book with 10000 pages. Each page is a frame ID (the page number is the frame ID) which has many cars in it with unique vehicle ID. So any page shows an image which is took from above and there are many cars inside this image. If we attach the images together with a time interval of 0.1 second we get the driving film of the vehicles. This data includes the x and y coordinates of the vehicles which each frame can be depicted using the "plot" command in MATLAB. This data also includes the preceding vehicle (the vehicle in front of the subject vehicle) with its ID in the 16th column in each row. It is worth note that the information of all vehicles are present in the data. If there is no vehicle in front of the subject vehicle the number in the 16th column is zero. So any row shows the information of only one vehicle. The data are sorted with respect to frame ID.
Now I need to extract the row of the preceding vehicle from the whole matrix and place it in the matrix of PV. The problem is that the percentage goes slowly when it gets to about 5%. Here is a sample of the data:
[629 2033...581]
the first column is the vehicle ID, the second is the frame ID and the 16th is the preceding vehicle ID in this frame ID. Here the car number 581 is in front of the car number 629 in the frame ID of 2033. Now I need to extract the data for the vehicle ID 581 in the frame 2033 and place it in the PV matrix.
More samples: the first is the vehicle ID, the second is the frame ID and the third number is the preceding vehicle ID.
[629 2033 688 1113433338200 28.703 462.09 6042802.932 2133529.776 56.3 7.9 3 12.8 5.09 3 581 640 95.39]
[577 2033 465 1113433338200 79.392 618.232 6042833.946 2133691.06 17.3 8.4 2 30.19 -0.37 7 0 3362 0]
[580 2033 621 1113433338200 53.4 542.455 6042817.601 2133612.708 18.3 7.5 2 20.49 -0.09 5 572 3361 80.9]
[581 2033 565 1113433338200 27.252 557.481 6042789.779 2133624.359 16.8 7.4 2 21.25 4.19 3 573 629 62.54]
Sorry for the long explanation and thanks for your help in advance.

With the help of others I found the answer:
We first need to extract each frame ID into one cell array then apply the code for each frame.
N = max(H1(:,2));
for i=1:N
display('first')
i*100/N
F{i} = H1(H1(:,2)==i,:);
end
F = F(~cellfun(#isempty, F));
this code divides the frames. Then this is applied:
for j=1:10000
m = size(F{1,j},1);
for i=1:m
i*100/m % Shows what percent of the work done
g = F{1,j}(F{1,j}(:,1)==F{1,j}(i,16),:);
if isempty(g)==1
F{1,j}(i,[18:34]) = zero(1,:); % Preceding vehicle
else
F{1,j}(i,[18:34]) = g(1,[1:17]);
end
end
end
Thanks for the help. #hypfco and #m7913d

Related

How to efficiently create an array by tracing back the parent nodes in Matlab?

I am working on a path planner algorithm. I have a Nx2 matrix NodeInfo which has the current node number in its 1st column and parent node number in its 2nd column. For example:
NodeInfo = [3,1;
4,1;
5,2;
6,2;
7,3;
8,4;
9,4;
10,4;
11,5;
12,6;
13,6;
14,6;
15,7;
16,7;
17,8;
18,8;
19,9;
20,9;
21,10;
22,10;
23,11;
24,11;
25,12;
26,12];
When the algorithm reaches to a goal it outputs the node number, which is 26 in this case. I am looking for a smart way of tracking back the parent nodes and creating an array of the nodes that resulted with the goal node. So the output should be:
Array = [26, 12, 6, 2];
Thanks!
p = NodeInfo(end,1);
parents = [p]
while (~isempty(p))
p = NodeInfo(find(NodeInfo(:,1)==p),2)
parents = [parents p]
end
The answer is stored in the parents
The code below uses a container, and it may take some time to build up a hashmap, but it might faster than find() when there is actually a larger dataset with a vast number of requests.
Edit: Added 2 nodes into NodeMap to prevent isKey() function in while condition from wasting too much time.
NodeMap = containers.Map(NodeInfo(:,1),NodeInfo(:,2)); %Create a container
NodeMap(1)=0; NodeMap(2)=0; %Add 2 nodes
nodes=zeros(1,length(NodeMap)); %pre-allocate
k=2; [N,nodes(1)]=deal(26); %Init parameters
while(N>0)
[nodes(k),N]=deal(NodeMap(N));
k=k+1;
end
nodes(nodes == 0)=[] %Cleaning up & print
The output of N=26 is:
nodes =
26 12 6 2
hope it helps!

MATLAB is too slow calculation of Spearman's rank correlation for 9-element vectors

I need to calculate the Spearman's rank correlation (using corr function) for pairs of vectors with different lengths (for example 5-element vectors to 20-element vectors). The number of pairs is usually above 300 pairs for each length. I track the progress with waitbar. I have noticed that it takes unusually very long time for 9-element pair of vectors, where for other lengths (greater and smaller) it takes very short times. Since the formula is exactly the same, the problem must have originated in MATLAB function corr.
I wrote the following code to verify that the problem is with corr function and not other calculations that I have besides 'corr', where all of that calculations (including 'corr') take place inside some 2 or 3 'for' loops. The code repeats the timing 50 times to avoid accidental results.
The result is a bar graph, confirming that it takes a long time for MATLAB to calculate Spearman's rank correlation for 9-element vectors. Since my calculations are not that heavy, this problem does not cause endless wait, it just increases the total time consumed for the whole process. Can someone tell me that what causes the problem and how to avoid it?
Times1 = zeros(20,50);
for i = 5:20
for j = 1:50
tic
A = rand(i,2);
[r,p] = corr(A(:,1),A(:,2),'type','Spearman');
Times1(i,j) = toc;
end
end
Times2 = mean(Times1,2);
bar(Times2);
xticks(1:25);
xlabel('number of elements in vectors');
ylabel('average time');
After some investigation, I think I found the root of this very interesting problem. My tests have been conducted profiling every outer iteration using the built-in Matlab profiler, as follows:
res = cell(20,1);
for i = 5:20
profile clear;
profile on -history;
for j = 1:50
uni = rand(i,2);
corr(uni(:,1),uni(:,2),'type','Spearman');
end
profile off;
p = profile('info');
res{i} = p.FunctionTable;
end
The produced output looks like this:
The first thing I noticed is that the Spearman correlation for matrices with a number of rows less than or equal to 9 is computed in a different way than for matrices with 10 or more rows. For the former, the functions being internally called by the corr function are:
Function Number of Calls
----------------------- -----------------
'factorial' 100
'tiedrank>tr' 100
'tiedrank' 100
'corr>pvalSpearman' 50
'corr>rcumsum' 50
'perms>permsr' 50
'perms' 50
'corr>spearmanExactSub' 50
'corr>corrPearson' 50
'corr>corrSpearman' 50
'corr' 50
'parseArgs' 50
'parseArgs' 50
For the latter, the functions being internally called by the corr function are:
Function Number of Calls
----------------------- -----------------
'tiedrank>tr' 100
'tiedrank' 100
'corr>AS89' 50
'corr>pvalSpearman' 50
'corr>corrPearson' 50
'corr>corrSpearman' 50
'corr' 50
'parseArgs' 50
'parseArgs' 50
Since the computation of the Spearman correlation for matrices with 10 or more rows seems to run smoothly and quickly and doesn't show any evidence of performance bottlenecks, I decided to avoid losing time investigating on this fact and I focused on the main concern: the small matrices.
I tried to understand the difference between the execution time of the whole process for a matrix with 5 rows and for one with 9 rows (the one notably showing the worst performance). This is the code I used:
res5 = res{5,1};
res5_tt = [res5.TotalTime];
res5_tt_perc = ((res5_tt ./ sum(res5_tt)) .* 100).';
res9_tt = [res{9,1}.TotalTime];
res9_tt_perc = ((res9_tt ./ sum(res9_tt)) .* 100).';
res_diff = res9_tt_perc - res5_tt_perc;
[~,res_diff_sort] = sort(res_diff,'desc');
tab = [cellstr(char(res5.FunctionName)) num2cell([res5_tt_perc res9_tt_perc res_diff])];
tab = tab(res_diff_sort,:);
tab = cell2table(tab,'VariableNames',{'Function' 'TT_M5' 'TT_M9' 'DIFF'});
And here is the result:
Function TT_M5 TT_M9 DIFF
_______________________ _________________ __________________ __________________
'corr>spearmanExactSub' 7.14799963478685 16.2879721171023 9.1399724823154
'corr>pvalSpearman' 7.98185309750143 16.3043118970503 8.32245879954885
'perms>permsr' 3.47311716905926 8.73599255035966 5.26287538130039
'perms' 4.58132952553723 8.77488502392486 4.19355549838763
'corr>corrSpearman' 15.629476293326 16.440893059217 0.811416765890929
'corr>rcumsum' 0.510550019981949 0.0152486312660671 -0.495301388715882
'factorial' 0.669357868472376 0.0163923929871943 -0.652965475485182
'parseArgs' 1.54242684137027 0.0309456171268161 -1.51148122424345
'tiedrank>tr' 2.37642998160463 0.041010720272735 -2.3354192613319
'parseArgs' 2.4288171135289 0.0486075856244615 -2.38020952790444
'corr>corrPearson' 2.49766877262937 0.0484657591710417 -2.44920301345833
'tiedrank' 3.16762535118088 0.0543584195582888 -3.11326693162259
'corr' 21.8214856092549 16.5664346332513 -5.25505097600355
Once the bottleneck was detected, I started analyzing the internal code (open corr) and I finally found the cause of the problem. Within the spearmanExactSub, this part of code is being executed (where n is the number of rows of the matrix):
n = arg1;
nfact = factorial(n);
Dperm = sum((repmat(1:n,nfact,1) - perms(1:n)).^2, 2);
A permutation is being computed on a vector whose values range from 1 to n. This is what comes into play increasing the computational complexity (and, obviously, the computational time) of the function. Other operations, like the subsequent repmat on factorial(n) of 1:n and the ones below that point, contribute to worsen the situation. Now, long story short...
factorial(5) = 120
factorial(6) = 720
factorial(7) = 5040
factorial(8) = 40320
factorial(9) = 362880
can you see the reason why, between 5 and 9, your bar graph shows an "exponentially" increasing computational time?
On a side note, there is nothing you can do to solve this problem, unless you find another implementation of the Spearman correlation that doesn't present the same bottleneck or you implement your own.

How can I incorporate a for loop into my genetic algorithm?

I'm doing a genetic algorithm that attempts to find an optimized solution over the course of 100 generations. My code as is will find 2 generations. I'm trying to find a way to add a for loop in order to repeat the process for the full duration of 100 generations.
clc,clear
format shorte
k=80;
mu=50;
s=.05;
c1=k+(4/3)*mu;
c2=k-(2/3)*mu;
for index=1:50 %6 traits generated at random 50 times
a=.005*rand-.0025;
b=.005*rand-.0025;
c=.005*rand-.0025;
d=.005*rand-.0025;
e=.005*rand-.0025;
f=.005*rand-.0025;
E=[c1,c2,c2,0,0,0;
c2,c1,c2,0,0,0;
c2,c2,c1,0,0,0;
0,0,0,mu,0,0;
0,0,0,0,mu,0;
0,0,0,0,0,mu];
S=[a;d;f;2*b;2*c;2*e];
G=E*S;
g=G(1);
h=G(2);
i=G(3);
j=G(4);
k=G(5);
l=G(6);
F=[(g-h)^2+(h-i)^2+(i-g)^2+6*(j^2+k^2+l^2)];
PI=((F-(2*s^2))/(2*s^2))^2; %cost function, fitness assessed
RP(index,:)=[a,b,c,d,e,f,PI]; %initial random population
end
Gen1=sortrows(RP,7,{'descend'}); %the initial population ranked
%for loop 1:100 would start here
children=zeros(10,6); %10 new children created from the top 20 parents
babysitter=1;
for parent=1:2:20
theta=rand(1);
traita=theta*Gen1(parent,1)+(1-theta)*Gen1(1+parent,1);
theta=rand(1);
traitb=theta*Gen1(parent,2)+(1-theta)*Gen1(1+parent,2);
theta=rand(1);
traitc=theta*Gen1(parent,3)+(1-theta)*Gen1(1+parent,3);
theta=rand(1);
traitd=theta*Gen1(parent,4)+(1-theta)*Gen1(1+parent,4);
theta=rand(1);
traite=theta*Gen1(parent,5)+(1-theta)*Gen1(1+parent,5);
theta=rand(1);
traitf=theta*Gen1(parent,6)+(1-theta)*Gen1(1+parent,6);
children(babysitter,:)=[traita,traitb,traitc,traitd,traite,traitf];
babysitter=babysitter+1;
end
top10parents=Gen1(1:10,1:6);
Gen1([11:50],:)=[]; %bottom 40 parents removed
for newindex=1:30 %6 new traits generated randomly 30 times
newa=.005*rand-.0025;
newb=.005*rand-.0025;
newc=.005*rand-.0025;
newd=.005*rand-.0025;
newe=.005*rand-.0025;
newf=.005*rand-.0025;
newgenes(newindex,:)=[newa,newb,newc,newd,newe,newf];
end
nextgen=[top10parents;children;newgenes]; %top 10 parents, the 10 new children, and the new 30 random traits added into one new matrix
for new50=1:50
newS=[nextgen(new50,1);nextgen(new50,4);nextgen(new50,6);2*nextgen(new50,2);2*nextgen(new50,3);2*nextgen(new50,5)];
newG=E*newS;
newg=newG(1);
newh=newG(2);
newi=newG(3);
newj=newG(4);
newk=newG(5);
newl=newG(6);
newF=[(newg-newh)^2+(newh-newi)^2+(newi-newg)^2+6*(newj^2+newk^2+newl^2)]; %von-Mises criterion
newPI=((newF-(2*s^2))/(2*s^2))^2; %fitness assessed for new generation
PIcolumn(new50,:)=[newPI];
end
nextgenwPI=[nextgen,PIcolumn]; %pi column added to nextgen matrix
Gen2=sortrows(nextgenwPI,7,{'descend'}) %generation 2 ranked
So my question is, how can I get the generations to count themselves in order to make the for loop work. I've searched for an answer and I've read that having matrices count themselves is not a good idea. However, I'm not sure how I could do this besides finding a way to make a genN matrix that counts upward in increments of 1 after the first generation. Any suggestions?
Thank you

Detecting if values are within range of each other and taking a midpoint - MATLAB

Following on from: Detecting if any values are within a certain value of each other - MATLAB
I am currently using randi to generate a random number from which I then subtract and add a second number - generated using poissrnd:
for k=1:10
a = poissrnd(200,1);
b(k,1) = randi([1,20000]);
c(k,1:2) = [b(k,1)-a,b(k,1)+a];
end
c = sort(c);
c provides an output in this format:
823 1281
5260 5676
5372 5760
5379 5779
6808 7244
6869 7293
9203 9653
12197 12563
14411 14765
15302 15670
Which are essentially the boundaries +/- a around the point chosen in b.
I then want to set an additional variable (i.e. d = 2000) which is used as the threshold by which values are matched and then merged. The boundaries are taken into consideration for this - the output of the above value when d = 2000 would be:
1052
7456
13933
The boundaries 823-1281 are not within 2000 of any other value so the midpoint is taken - reflecting the original value. The next midpoint taken is between 5260 and 9653 because as you go along, each successive values is within 2000 of the one before it until 9653. The same logic is then applied to take the midpoint between 12197 and 15670.
Is there a quick and easy way to adapt the answer give in the linked question to deal with a 2 column format?
EDIT (in order to make it clearer):
The values held in c can be thought of as demarcating the boundaries of 'blocks' that sit on a line. Every single boundary is checked to see if anything lies within 2000 of it (the black lines).
As soon as any black line touches a red block, that entire red block is incorporated into the same merge block - in full. This is why the first midpoint value calculated is 1052 - nothing is touched by the two black lines emanating from the first two boundaries. However the next set of blocks all touch one another. This incorporates them all into the merge such that the midpoint is taken between 9653 and 5260 = 7456.
The block starting at 12197 is out of reach of it's preceding one so it remains separate. I've not shown all the blocks.
EDIT 2 #Esteban:
b =
849
1975
8336
9599
12057
12983
13193
13736
16887
18578
c =
662 1036
1764 2186
8148 8524
9386 9812
11843 12271
12809 13157
12995 13391
13543 13929
16687 17087
18361 18795
Your script then produces the result:
8980
12886
17741
When in fact it should be:
1424
8980
12886
17741
So it is just missing the first value - if no merge is occurring, the midpoint is just taken between the two values. Sometimes this seems to work - other times it doesn't.
For example here it works (when value is set to 1000 instead of 2000 as a test):
c =
2333 2789
5595 6023
6236 6664
10332 10754
11425 11865
12506 12926
12678 13114
15105 15517
15425 15797
19490 19874
result =
2561
6129
11723
15451
19682
See if this works for you -
th = 2000 %// threshold
%// Column arrays
col1 = c(:,1)
col2 = c(:,2)
%// Position of "group" shifts
grp_changes = diff([col2(1:end-1,:) col1(2:end,:)],[],2)>th
%// Start and stop positions of shifts
stops = [grp_changes ; 1]
starts = [1 ; stops(1:end-1)]
%// Finally the mean of shift positions, which is the desired output
out = floor(mean([col1(starts~=0) col2(stops~=0)],2))
Not 100% sure if it will work for all your samples... but this is the code I came up with which works with at least the data in your example:
value=2000;
indices = find(abs(c(2:end,1)-c(1:end-1,2))>value);
indices = vertcat(indices, length(c));
li = indices(1:end-1)+1;
ri = indices(2:end);
if li(1)==2
li=vertcat(1,li);
ri=vertcat(1,ri);
end
result = floor((c(ri,2)+c(li,1))/2)
it's not very clean and could surely be done in less lines, but it's easy to understand and it works, and since your c will be small, I dont see the need to further optimize this unless you will run it millions of time.

Vectorizing set operations for string-valued cell arrays in MATLAB/Octave

I have a large data set, X, comprising demographic information of survey respondents. The data is largely categorical, so each row in X contains a bunch of string-valued features such as gender, race, interests, etc for a single respondent. Each column of X is a single response category. I have loaded this data set into a big cell array in MATLAB/Octave (testing on both). I would like to measure the Jaccard distance between each sample and every other sample in the data set. Basically what I want to do is this:
dist = zeros(size(X,1)); % Initialize my distance matrix
for ii = 1:size(X,1)
for jj = ii:size(X,1) % Only need the upper triangle since dist is symmetric
% Find the Jaccard distance between the ii-th and jj-th respondent
dist(ii,jj) = 1 - numel(intersect(X(ii,:), X(jj,:))) / numel(union(X(ii,:), X(jj,:)));
end
end
Except obviously I want to vectorize the code. I have tried using cellfun and bsxfun to vectorize, but when I do something like:
res = cellfun('intersect', X, X, 'UniformOutput', false);
I get a cell array the same size as X, wherein the (i,j)-element is equivalent to intersect(X(i,j), X(i,j)); basically the unique characters in the (i-j)-cell. This does not help me. When I try:
res = bsxfun('intersect', X, X);
I get one long cell array containing (I think) all of the unique values that any cell in X takes. This does not help me either.
I would like a solution that enables me to vectorize the code at the beginning of this discussion. If it is easier to do so, a code that finds the subset of X with the minimum (or maximum) Jaccard distance from any one row in X would be exactly what I need.
Thanks in advance!
EDIT: Changed the loop code to only calculate the upper triangle of dist. Still takes far too long, and the fact that it is non-vectorized bugs me on a philosophical level.
EDIT: The first element of X, given by typing X(1,:) is:
ans =
{
[1,1] = Non - U.S. Citizen
[1,2] = Denied
[1,3] = M
[1,4] = CHINA
[1,5] = Full Time
[1,6] = D-Asian American or Pacific Islander
[1,7] =
[1,8] =
[1,9] = MSME
[1,10] =
}
This is just testing data for developing the algorithm while I wait on my actual survey results, but the survey results will have a similar form.
EDIT: More data from X, but in CSV form, is as follows:
Non - U.S. Citizen,Denied,M,INDIA,Full Time,E-Other,,,MSME,
Non - U.S. Citizen,Denied,F,INDIA,Full Time,D-Asian American or Pacific Islander,,,MSME,DESIGN
Non - U.S. Citizen,Denied,M,INDIA,Full Time,E-Other,,,MS,
Non - U.S. Citizen,Denied,M,IRAN,Full Time,B-Caucasian American Non-Hispanic,,,PhD,NANO
Non - U.S. Citizen,Left Without Degree,M,JORDAN,Full Time,E-Other,,,,
Non - U.S. Citizen,Denied,F,IRAN,Full Time,E-Other,,,PhD,BIOENG
,Not Attending,M,,Full Time,,,,PhD,
Non - U.S. Citizen,Not Attending,F,IRAN,Full Time,I-International Student,,,PhD,
Non - U.S. Citizen,Denied,M,BANGLADESH,Full Time,E-Other,,,PhD,NANO
Non - U.S. Citizen,Denied,M,BANGLADESH,Full Time,E-Other,,,MS,
This might be a workaround, I'll illustrate on a single row of data:
a={'Non - U.S. Citizen','Denied','M','INDIA','Full Time','E-Other','','','MSME',''}
Sum each cell element, this casts the strings to doubles and sum thier value. It will work assuming the odds for a non unique sum result are slim (if not there's a trick you can implement, but I doubt it'll actually happen):
b=cellfun(#sum,a,'un',0)
now you have a single number per cell element, you can use cell2mat to get a matrix and \ or pdist etc ...