Tensorflow: GPU Utilization is almost always at 0%

Tensorflow: GPU Utilization is almost always at 0% - neural-network

I'm using tensorflow with Titan-X GPUs and I've noticed that, when I run the CIFAR10 example, the Volatile GPU-utilization is pretty constant around 30%, whereas when I train my own model, the Volatile GPU-utilization is far from steady, it is almost always 0% and spikes at 80/90% before going back to 0%, over and over again.
I thought that this behavior was due to the way I was feeding the data to the network (I was fetching the data after each step, which took some time). But after implementing a queue to feed the data and avoid this latency between steps, the problem persisted (see below for the queuing system).
Any idea?
batch = 128 # size of the batch
x = tf.placeholder("float32", [None, n_steps, n_input])
y = tf.placeholder("float32", [None, n_classes])
# with a capacity of 100 batches, the bottleneck should not be the data feeding
queue = tf.RandomShuffleQueue(capacity=100*batch,
min_after_dequeue=80*batch,
dtypes=[tf.float32, tf.float32],
shapes=[[n_steps, n_input], [n_classes]])
enqueue_op = queue.enqueue_many([x, y])
X_batch, Y_batch = queue.dequeue_many(batch)
sess = tf.Session()
def load_and_enqueue(data):
while True:
X, Y = data.get_next_batch(batch)
sess.run(enqueue_op, feed_dict={x: X, y: Y})
train_thread = threading.Thread(target=load_and_enqueue, args=(data))
train_thread.daemon = True
train_thread.start()
for _ in xrange(max_iter):
sess.run(train_op)

After doing some experiments, I found the answer so I post it since it could be useful to someone else.
First, get_next_batch is approximately 15x slower than train_op (thanks to Eric Platon for pointing this out).
However, I thought that the queue was being fed up to capacity and that only after the training was supposed to begin. Hence, I thought that even if get_next_batch was way slower, the queue should hide this latency, in the beginning at least, since it holds capacity examples and it would need to fetch new data only after it reaches min_after_dequeue which is lower than capacity and that it would result in a somehow steady GPU utilization.
But actually, the training begins as soon as the queue reaches min_after_dequeue examples. Thus, the queue is being dequeued as soon as the queue reaches min_after_dequeue examples to run the train_op, and since the time to feed the queue is 15x slower than the execution time of train_op, the number of elements in the queue drops below min_after_dequeue right after the first iteration of the train_op and the train_op has to wait for the queue to reach again min_after_dequeue examples.
When I force the train_op to wait until the queue is fed up to capacity (with capacity = 100*batch) instead of starting automatically when it reaches min_after_dequeue (with min_after_dequeue=80*batch), the GPU utilization is steady for like 10 seconds before going back to 0%, which is understandable since the queue reaches min_after_dequeue example in less than 10 seconds.

Related

matlab parfor is very slow with operation on a large matrix

I am writing a matlab code, which does some operations on a large matrix. First I create three 3D array
dw2 = 0.001;
W2 = [0:dw2:1];
dp = 0.001;
P1 = [dp:dp:1];
dI = 0.001;
I = [1:-dI:0];
[II,p1,ww2] = ndgrid(I,P1,W2);
Then my code basically does the following
G = [0:0.1:10]
Y = zeros(length(G),1)
for i = 1:1:length(G)
g = G(i);
Y(i) = myfunction(II,p1,ww2,g)
end
This code roughly takes about 100s, with each iteration being nearly 10s.
However, after I start parfor
ProcessPool with properties:
Connected: true
NumWorkers: 48
Cluster: local
AttachedFiles: {}
AutoAddClientPath: true
IdleTimeout: 30 minutes (30 minutes remaining)
SpmdEnabled: true
Then it is like running forever. The maximum number of workers is 48. I've also tried 2, 5, 10. All of these are slower than non-parallel computing. Is that because matlab copied II,p1,ww2 48 times and that causes the problem? Also myfunction involves a lot of vectorization. I have already optimized the myfunction. Will that lead to slow performance of parfor? Is there a way to utilize (some of) the 48 workers to speed up the code? Any comments are highly appreciated. I need to run millions of cases. So I really hope that I can utilize the 48 workers in some way.

It seems that you have large data, and a lot of cores. It is likely that you simply run out of memory, which is why things get so slow.
I would suggest that you set up your workers to be threads, not separate processes.
You can do this with parpool('threads'). Your code must conform to some limitations, not all code can be run this way, see here.
In thread-based parallelism, you have shared memory (arrays are not copied). In process-based parallelism, you have 48 copies of MATLAB running on your computer at the same time, each needing their own copy of your data. That latter system was originally designed to work on a compute cluster, and was later retrofitted to work on a single machine with two or four cores. I don’t think it was ever meant for 48 cores.
If you cannot use threads with your code, configured your parallel pool to have fewer workers. For example parpool('local',8).
For more information, see this documentation page.

CPU utilization calculation

I've read in many places that a simple and decent way to get the % of CPU utilization is by this formula:
CPU utilization = 1 - p^n
where:
p - blocked time
n - number of processes
But i can't find an explanation for it. Seems it has to do with statistics, but i can't wrap my head around it.
My starting point is: if i have 2 processes with 50% wait time, then the formula would yield 1 - 1/4 = 75% CPU utilization. But my broken logic begs the question: if one process is blocked on I/O and the other is swapped in to run when the first is blocked(whatever the burst is), that means that while one waits, the second runs and their wait time overlap. Isn't that 100% CPU utilization? I think this is true only when the first half of the programs is guaranteed to run without IO need.
Question is: How is that formula taking into account every other possibility?

You need to think in terms of probabilities. If the probability of each of the cores to be idle (waiting for IO) is 0.5 then the probability of the CPU to be in idle is the probability of all of the cores to be in idle at the same time. That is 0.5 * 0.5 = 0.25 and so the probability the CPU is doing work is 1 - 0.25 = 0.75 = 75%

CPU utilisation is given as 1 - probability of CPU to being in the idle state
and CPU remain in the idle state when all the process loaded in the main memory is blocked time(I/O)
So if n process has wait time of 50% the probability that all the process are in
block(I/O) state

Why is the Matlab parfor scheduler leaving workers idle?

I have a fairly long-running parfor loop (let's say 100,000 iterations where each iterations takes about a minute) that I'm running with 36 cores. I've noticed that near the end of the job, a large number of the cores go idle while a few finish what I think has to be multiple iterations per worker. This leads to a lot of wasted computing time waiting for one worker to finish several jobs while the others are sitting idle.
The following script shows the issue (using the File Exchange utility Par.m):
% Set up parallel pool
nLoop = 480;
p1 = gcp;
% Run a loop
pclock = Par(nLoop);
parfor iLoop = 1:nLoop
Par.tic;
pause(0.1);
pclock(iLoop) = Par.toc;
end
stop(pclock);
plot(pclock);
% Process the timing info:
runs = [[pclock.Worker]' [pclock.ItStart]' [pclock.ItStop]'];
nRuns = arrayfun(#(x) sum(runs(:,1) == x), 1:max(runs));
starts = nan(max(nRuns), p1.NumWorkers);
ends = nan(max(nRuns), p1.NumWorkers);
for iS = 1:p1.NumWorkers
starts(1:nRuns(iS), iS) = sort(runs(runs(:, 1) == iS, 2));
ends(1:nRuns(iS), iS) = sort(runs(runs(:, 1) == iS, 3));
end
firstWorkerStops = min(max(ends));
badRuns = starts > firstWorkerStops;
nBadRuns = sum(sum(badRuns)) - (p1.NumWorkers-1);
fprintf('At least %d (%3.1f%%) iterations run inefficiently.\n', ...
nBadRuns, nBadRuns/nLoop * 100);
The way I'm looking at it, every worker should be busy until the queue is empty, after which all workers sit idle. But here it looks like that's not happening - with 480 iterations, I'm getting between 6-20 iterations that start on a worker after a different worker has been sitting idle for a full cycle. This number appears to scale linearly with the number of loop iterations coming in near 2% of the total. With limited testing this appears to be consistent across Matlab 2016b and 2014b.
Is there any reason that this is the expected behavior or is this simply a poorly written scheduler in the parfor implementation? If so, how can I structure this so I'm not sitting around with idle workers so long?

I think this explains what you are observing.
If there are more iterations than workers, some workers perform more than one loop iteration; in this case, a worker might receive multiple iterations at once to reduce communication time. (From "When to Use parfor")
Towards the end of a loop, a two workers may finish their iterations at around the same time. If there is only one group of iterations left to be assigned, then one worker will get them all, and the other will remain idle. It sounds like it is expected behavior, and it is probably because the underlying implementation tries to reduce the communication cost associated with a worker pool. I've looked around the web and the Matlab settings and it doesn't seem like there is a way to adjust the communication strategy.

The parfor scheduler attempts to load-balance for loops where the iterations do not take a uniform amount of time. Unfortunately, as you observe, this can lead to workers becoming idle at the end of the loop. With parfor, you don't have control over the division of work; but you could use parfeval to divide your work up into even chunks - that might give you better utilisation. Or, you could even use spmd in conjunction with a for-drange loop.

When is SJF worse than FCFS?

In operating systems of supercomputers, which handles a big quantity of tasks at the same time, is there any situation when SJF policy is taking longer than FCFS policy, speaking of waiting time metric?
It can be assumed that more than one core are present in the system.

First I thought that it is not possible, then I took some time and finally arrived at this result:
Yes it can be.
Suppose that ready queue is filled with processes with equal burst times(all = x):
Process Burst time
P1 x
P2 x
P3 x
P4 x
. .
. .
. .
Pn x
Now in this case what FCFS would do, the process that would come first will be allocated the CPU and then the next process which comes first will be allocated the CPU and so on without wasting any time.
But what SJF will do is :it will first find the job with the shortest burst time from the available jobs in the ready queue which in this case is wastage of time as all have equal burst times and SJF would end up traversing the ready queue without any fruitful result.

Interrupt time in DMA operation

I'm facing difficulty with the following question :
Consider a disk drive with the following specifications .
16 surfaces, 512 tracks/surface, 512 sectors/track, 1 KB/sector, rotation speed 3000 rpm. The disk is operated in cycle stealing mode whereby whenever 1 byte word is ready it is sent to memory; similarly for writing, the disk interface reads a 4 byte word from the memory in each DMA cycle. Memory Cycle time is 40 ns. The maximum percentage of time that the CPU gets blocked during DMA operation is?
the solution to this question provided on the only site is :
Revolutions Per Min = 3000 RPM
or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50
= 6553600 ............. (1)
Interrupt = 6553600 takes 0.2621 sec
Percentage Gain = (0.2621/1)*100
= 26 %
I have understood till (1).
Can anybody explain me how has 0.2621 come ? How is the interrupt time calculated? Please help .

Reversing form the numbers you've given, that's 6553600 * 40ns that gives 0.2621 sec.
One quite obvious problem is that the comments in the calculations are somewhat wrong. It's not
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50 <- WRONG
The numbers are 512K / 4 * 50. So, it's in bytes. How that could be called 'number of tracks'? Reading the full track is 1 full rotation, so the number of tracks readable in 1 second is 50, as there are 50 RPS.
However, the total bytes readable in 1s is then just 512K * 50 since 512K is the amount of data on the track.
But then it is further divided by 4..
So, I guess, the actual comments should be:
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
Interrupts per second = (2^19/2^2) * 50 = 6553600 (*)
Interrupt triggers one memory op, so then:
total wasted: 6553600 * 40ns = 0.2621 sec.
However, I don't really like how the 'number of interrupts per second' is calculated. I currently don't see/fell/guess how/why it's just Bytes/4.
The only VAGUE explanation of that "divide it by 4" I can think of is:
At each byte written to the controller's memory, an event is triggered. However the DMA controller can read only PACKETS of 4 bytes. So, the hardware DMA controller must WAIT until there are at least 4 bytes ready to be read. Only then the DMA kicks in and halts the bus (or part of) for a duration of one memory cycle needed to copy the data. As bus is frozen, the processor MAY have to wait. It doesn't NEED to, it can be doing its own ops and work on cache, but if it tries touching the memory, it will need to wait until DMA finishes.
However, I don't like a few things in this "explanation". I cannot guarantee you that it is valid. It really depends on what architecture you are analyzing and how the DMA/CPU/BUS are organized.

The only mistake is its not
no. of tracks read
Its actually no. of interrupts occured (no. of times DMA came up with its data, these many times CPU will be blocked)
But again I don't know why 50 has been multiplied,probably because of 1 second, but I wish to solve this without multiplying by 50

My Solution:-
Here, in 1 rotation interface can read 512 KB data. 1 rotation time = 0.02 sec. So, one byte data preparation time = 39.1 nsec ----> for 4B it takes 156.4 nsec. Memory Cycle time = 40ns. So, the % of time the CPU get blocked = 40/(40+156.4) = 0.2036 ~= 20 %. But in the answer booklet options are given as A) 10 B)25 C)40 D)50. Tell me if I'm doing wrong ?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse