I'm using MATLAB 2016a on win 10 64bit OS. I run my program which is almost a complicated simulation of an engineering problem.
Question is that i use parfor and there is 2 other for loops in this program. I've been cautious to use minimum for loops and using array smart and built in commands such as repmat, bsxfun and etc. to avoid for loops. when i run the program it goes quite a while nice and stores results for me but suddenly after some iterations i encounter this error:
"All workers aborted during execution of the parfor loop."
and program terminates. I'm using a powerful system with these specs:
corei7 CPU intel 4720HQ, 16 GB RAM DDR4, 8MB cache, GPU: Geforce GTX 970M.
An example will be like this (although main program is very more demanding from both memory and computational point of view and I've omitted many lines and also 3 functions are called which are not included here):
lambda = 5e-5;
tau = 10.^((-5:25)*0.1);
tau = 0.3;
eta = 1.5;
b = 0.3;
c = 0.4;
beta = (0:90)';
x = (0.01:1000+0.01)';
r = (80.21:800+80.21)';
h = (10:0.1:30.5)';
Lh = length(h);
Lr = length(r);
Lx = length(x);
N = 6;
binom_coeff = factorial(N)*ones(N,1)./(factorial((1:N)').*factorial((N(1:N))'));
pdf_x = 2*pi*x*lambda.*exp(-pi*lambda*x.^2);
pdf_R = 2*pi*lambda*r.*exp(-pi*lambda*r.^2);
theta_l = atan(repmat(h,1,Lr)./repmat(r',Lh,1))*180/pi;
ratio = sqrt(repmat(h,1,Lr)+repmat(r',Lh,1));
coverage = zeros(size(beta_m));
Integrand_x = zeros(size(x));
Y = (b*h+c)*(1-a);
for k=1:length(beta_m)
for thr = 1:length(tau)
parfor i=1:Lx
temp = (-1)*eta*tau(thr)*(G_l/G_0.*( ratio/sqrt(x(i)^2+h_0^2)).^(-v));
temp_N = repmat(temp,1,N).*reshape(repmat(1:N,size(temp,1)*size(temp,2),1),size(temp,1),size(temp,2)*N);
Integrand = (1-(trapz(h,exp(temp_N).*repmat(Y,1,Lr*N))))';
Integrand_x(i) = exp(trapz(r,(Integrand * binom_coeff)));
end
coverage(thr,k) = trapz(x,pdf_x.*Integrand_x);
end
end
savepar = ['FinalMainRes_longheiv',num2str(v),'h0',num2str(h_0),'a',num2str(a),'.mat'];
save(savepar)
It's worth to mention that running with just one worker does not crush (although it took about 4 days to complete the run).
What is the problem and how can i prevent it. Any help is appreciated.
Thanks in advance.
Related
I am using Matlab GPU computing to run a simulation. I suspect I may encounter a "random number seed" overlapping issue. My code is the following
N = 10000;
v = rand(N,1);
p = [0:0.1:1];
pA = [0:0.1:2];
[v,p,pA] = ndgrid(v,p,pA);
v = gpuArray(v);
p = gpuArray(p);
pA = gpuArray(pA);
t = 1;
bH = 0.9;
bL = 0.6;
a = 0.5;
Y = MyFunction(v,p,pA,t,bH,bL,a);
function[RA] = MyFunction(v,p,pA,t,bH,bL,a)
function[RA] = SSP1(v,p,pA)
RA = 0;
S1 = rand;
S2 = rand;
S3 = rand;
vA1 = (S1<a)*bH+(S1>=a)*bL;
vA2 = (S2<a)*bH+(S2>=a)*bL;
vA3 = (S3<a)*bH+(S3>=a)*bL;
if p<=t && pA>3*bL && pA<=3*bH
if pA>vA1+vA2+vA3
if v>=p
RA = p;
end
else
if v+vA1+vA2+vA3>=p+pA
RA = p+pA;
end
end
end
end
[RA] = gather(arrayfun(#SSP1,v,p,pA));
end
The idea of the code is the following:
I generate N random agents, which is characterized by the value of v. Then for each agent, I have to compute a quantity given (p,pA). As I have N agents and many combinations of (p,pA), I want to use GPU to speed up the process. But here comes a tricky thing:
for each agent, in order to finish my computation, I have to generate 3 extra random variables, vA1,vA2,vA3. Based on my understanding of GPU (I could be wrong), it does these computations simultaneously, i.e, for each agent v, it generates 3 random variables vA1,vA2,vA3. And GPU does this N procedures at the same time. However, I am not sure whether for agent 1 and agent 2, the corresponding vA1,vA2,vA3 may overlap? Because here N could be 1 million. I want to make sure that for all of these agents, the random number seed that is used to generate their corresponding vA1,vA2,vA3 won't overlap; otherwise, I am in big trouble.
There is a way to prevent this from happening, which is: I first generate 3N of these random variables vA1,vA2,vA3. Then I put them into my GPU. However, that may require a lot of GPU memory, which I don't have. The current method, I guess does not need too much GPU memory, as I am generating vA1,vA2,vA3 on the fly?
What you say does not happen. The proof is that the following code snipped generates random values in hB.
A=ones(100,1);
dA=gpuArray(A);
[hB] = gather(arrayfun(#applyrand,dA));
function dB=applyrand(dA)
r=rand;
dB=dA*r;
end
That said, your code has only 12 values for your random variables (4 for each) because for your use of S1, S2 and S3 you are basically flipping a coin:
vA1 = (S1<0.5)*bH+(S1>=0.5)*bL;
so vA1 is either 0, bH, bL or bH+bL.
Maybe this lack of variability is what is making you think that you don't have much randomness, not very clear from the question.
I am trying to make a portion of my code run faster in MatLab, and I'd like to use parfor. When I try to, I get the following error about one of my variables D_all.
"The PARFOR loop cannot run because of the way D_all is used".
Here is a sample of my code.
M = 161;
N = 24;
P = 161;
parfor n=1:M*N*P
[j,i,k] = ind2sub([N,M,P],n);
r0 = Rw(n,1:3);
R0 = repmat(r0,M*N*P,1);
delta = sqrt(dXnd(i)^2 + dZnd(k)^2);
d = R_prime - R0;
inS = Rw_prime(find(sqrt(sum(d.^2,2))<0.8*delta),:);
if isempty(inS)
D_all(j,i,k,tj) = D_all(j,i,k,tj-1);
else
y0 = r0(2);
inC = inS(find(inS(:,2)==y0),:);
dw = sqrt(sum(d(find(sqrt(sum(d.^2,2))<0.8*delta & d(:,2)==0),:).^2,2));
V_avg = sum(dw.^(-1).*inC(:,4))/sum(dw.^(-1));
D_all(j,i,k,tj) = V_avg;
end
end
I'm not very familiar with parallel computing, and I've looked at the guides online and don't really understand how to apply them to my situation. I guess I need to "slice" D_all but I don't know how to do that.
EDIT: I think I understand that the major problem is that when using D_all I have tj and tj-1.
EDIT 2: I don't show this above, it probably would have been helpful, but I defined D_all(:,:,:,1) = V_1; where V_1 corresponds to a previous time step. I tried making multiple variables V_2, V_3, etc. for each step and replacing D_all(j,i,k,tj-1) with V_1(j,i,k). This still led to the same error I am seeing with D_all.
"Valid indices for D_all are restricted for PARFOR loops"
EDIT: I used the profiler as suggested, and it looks like Matlab was spending a significant amount of time dealing with symbols and solving the system of equations. So, I will change my question slightly: is there a faster way to implement this system of equations, perhaps one that does not involve declaring symbols?
function [As_rad, Ae_rad] = pos_to_angle(x_pos, y_pos)
% Converts given x and y coordinates into angles (using link lengths)
Ls = 0.4064; % in meters
Le = 0.51435; % in meters
x_offset = 0.0;
y_offset = -0.65; % from computer running the robot
syms angle_s angle_e
x = x_pos + x_offset; % Account for offset of origins
y = y_pos + y_offset; % between motor and workspace
eqn1 = x == Ls*cos(angle_s) + Le*cos(angle_e); % Actual conversion
eqn2 = y == Ls*sin(angle_s) + Le*sin(angle_e);
sol1 = solve([eqn1, eqn2], [angle_s, angle_e]);
As_rad_mat = sol1.angle_s;
Ae_rad_mat = sol1.angle_e;
if As_rad_mat(1) > Ae_rad_mat(1);
As_rad = As_rad_mat(1);
Ae_rad = Ae_rad_mat(1);
else
As_rad = As_rad_mat(2);
Ae_rad = Ae_rad_mat(2);
end
end
To be more specific, it looked like a function mupadmex (which I believe is associated with symbols) took up about 80% of the computing time. The above is just an example of how I solved systems of equations throughout the script.
Thanks everyone for the responses! I ended up using profiler and found that the computation time was coming from solving the system of equations every time the program went through the loop. So I used a separate script to solve for the equation symbolically, so the script only has to do algebra. It is running MUCH quicker now.
I have to calculate the std and mean of a large data set with respect to quite a few models. The final loop block is nested to four levels.
This is what it looks like:
count = 1;
alpha = 0.5;
%%%Below if each individual block is to be posterior'd and then average taken
c = 1;
for i = 1:numel(writers) %no. of writers
for j = 1: numel(test_feats{i}) %no. of images
for k = 1: numel(gmm) %no. of models
for n = 1: size(test_feats{i}{j},1)
[~, scores(c)] = posterior(gmm{k}, test_feats{i}{j}(n,:));
c = c + 1;
end
c = 1;
index_kek=find(abs(scores-mean(scores))>alpha*std(scores));
avg = mean(scores(index_kek)); %using std instead of mean... beacause of ..reasons
NLL(count) = avg;
count = count + 1;
end
count = 1; %reset count
NLL_scores{i}(j,:) = NLL;
end
fprintf('***score for model_%d done***\n', i)
end
It works and gives the desired result but it takes 3 days to give me the final calculation, even on my i7 processor. During processing the task manager tells me that only 20% of the cpu is being used, so I would rather put more load on the cpu to get the result faster.
Going by the official help here if I suppose want to make the outer most loop a parfor while keeping the rest normal for all I have to do is to insert integer limits rather than function calls such as size or numel.
So making these changes the above code will become:
count = 1;
alpha = 0.5;
%%%Below if each individual block is to be posterior'd and then average taken
c = 1;
num_writers = numel(writers);
num_images = numel(test_feats{1});
num_models = numel(gmm);
num_feats = size(test_feats{1}{1},1);
parfor i = 1:num_writers %no. of writers
for j = 1: num_images %no. of images
for k = 1: num_models %no. of models
for n = 1: num_feats
[~, scores(c)] = posterior(gmm{k}, test_feats{i}{j}(n,:));
c = c + 1;
end
c = 1;
index_kek=find(abs(scores-mean(scores))>alpha*std(scores));
avg = mean(scores(index_kek)); %using std instead of mean... beacause of ..reasons
NLL(count) = avg;
count = count + 1;
end
count = 1; %reset count
NLL_scores{i}(j,:) = NLL;
end
fprintf('***score for model_%d done***\n', i)
end
Is this the most optimum way to implement parfor in my case? Can it be improved or optimized further?
I couldn't test in Matlab for now but it should be close to a working solution. It has a reduced number of loops and changes a few implementation details but overall it might perform just as fast (or even slower) as your earlier code.
If gmm and test_feats take lots of memory then it is important that parfor is able to determine which peaces of data need to be delivered to which workers. The IDE should warn you if inefficient memory access is detected. This modification is especially useful if num_writers is much less than the number of cores in your CPU, or if it is only slightly larger (like 5 writers for 4 cores would take about as long as 8 writers).
[i_writer i_image i_model] = ndgrid(1:num_writers, 1:num_images, 1:num_models);
idx_combined = [i_writer(:) i_image(:) i_model(:)];
n_combined = size(idx_combined, 1);
NLL_scores = zeros(n_combined, 1);
parfor i_for = 1:n_combined
i = idx_combined(i_for, 1)
j = idx_combined(i_for, 2)
k = idx_combined(i_for, 3)
% pre-allocate
scores = zeros(num_feats, 1)
for i_feat = 1:num_feats
[~, scores(i_feat)] = posterior(gmm{k}, test_feats{i}{j}(i_feat,:));
end
% "find" is redundant here and performs a bit slower, might be insignificant though
index_kek = abs(scores - mean(scores)) > alpha * std(scores);
NLL_scores(i_for) = mean(scores(index_kek));
end
I have a backwards recursion for a binomial tree. At each node an unknown C enters in such a way that at the starting node we get a formula, A(1,1), that depends upon C. The code is as follows:
A=sym(zeros(1,Steps));
B=zeros(1,Steps);
syms C; % The unknown that enters A at every node
tic
for t=Steps-1:-1:1
% Values needed in A and B
Lambda=1-exp(-(1./S(t,1:t).^b).*h);
Q=((1./D(t))./(1-Lambda)-d)/(u-d);
R=normcdf(a0+a1*Lambda);
% the backward recursion for A and B
A(1:t)=D(t)*C+D(t)*...
(Q.*(1-Lambda).*A(1:t) ...
+ (1-Q).*(1-Lambda).*A(2:t+1));
B(1:t)=Lambda.*(1-R)+D(t)*...
(Q.*(1-Lambda).*B(1:t)...
+ (1-Q.*(1-Lambda).*B(2:t+1)));
end
C = solve(A(1,1)==sym(B(1,1)),C);
This code takes around 4 seconds if Steps = 104. If however we remove C and set matrix A to a regular double matrix, it only takes about 0.02 seconds. Using syms thus increases the calculation time by a factor 200. This seems too much to me. Any suggestions into speeding this up?
I am using Matlab 2013b on a MacBook air 13-inch spring 2013. Furthermore, if you're interested in the code before the above part (not sure whether it is relevant):
a0 = 0.9;
a1 = -3.2557;
b = 1.2594;
S0=18.57;
sigma=0.6579;
h=1/104;
T=1;
Steps=T/h;
f=transpose(normrnd(0.04, 0.001 [1 pl]));
D=exp(-h*f); % discount values
pl=T/h; % pathlength - amount of steps in maturity
u=exp(sigma*sqrt(h));
d=1/u;
u_row = repmat(cumprod([1 u*ones(1,pl-1)]),pl,1);
d_row = cumprod(tril(d*ones(pl),-1)+triu(ones(pl)),1);
path = tril(u_row.*d_row);
S=S0*path;
Unless I'm missing something, there's no need to use symbolic math or use an unknown variable. You can effectively assume that C = 1 in your recursion relation and solve for the actual value at the end. Here's the full code with some other improvements:
rng(1); % Always seed your random number generator
a0 = 0.9;
a1 = -3.2557;
b = 1.2594;
S0 = 18.57;
sigma = 0.6579;
h = 1/104;
T = 1;
Steps = T/h;
pl = T/h;
f = 0.04+0.001*randn(pl,1);
D = exp(-h*f);
u = exp(sigma*sqrt(h));
d = 1/u;
u_row = repmat(cumprod([1 u*ones(1,pl-1)]),pl,1);
d_row = cumprod(tril(d*ones(pl),-1)+triu(ones(pl)),1);
pth = tril(u_row.*d_row);
S = S0*pth;
A = zeros(1,Steps);
B = zeros(1,Steps);
tic
for t = Steps-1:-1:1
Lambda = 1-exp(-h./S(t,1:t).^b);
Q = ((1./D(t))./(1-Lambda)-d)/(u-d);
R = 0.5*erfc((-a0-a1*Lambda)/sqrt(2)); % Faster than normcdf
% Backward recursion for A and B
A = D(t)+D(t)*(Q.*(1-Lambda).*A(1:end-1) + ...
(1-Q).*(1-Lambda).*A(2:end));
B = Lambda.*(1-R)+D(t)*(Q.*(1-Lambda).*B(1:end-1) + ...
(1-Q.*(1-Lambda).*B(2:end)));
end
C = B/A
toc
This take about 0.005 seconds to run on my MacBook Pro. There are certainly other improvements you could make. There are many combinations of variables that are used in multiple places (e.g., 1-Lambda or D(t)*(1-Lambda)), that could be calculated once. Matlab may try to optimize this a bit. And you can try moving Lambda, Q, and R out of the loop – or at least calculate parts of them outside and save the results in arrays.