How to run a multi-queue code using OpenCL?

How to run a multi-queue code using OpenCL? - queue

For example, I'm doing some image processing work on every frame of a video.
Every frame's processing using 200ms including writing, processing and reading.
And the fps is 25, in that case every two frames' distance is 40ms. Then the processing is too slow to show continuous result.
So here is my idea, I use multi-queues for this work.
In CPU part,
while(video is not over)
{
1. read the frame0;
processing the **frame0** using **queue0**;
wait 40 ms;
2. read the frame1;
processing the frame1 using **queue1**;
wait 40 ms;
3.4.5.
...(after 5 frames(just about the 200ms's processing time))
6. download the **frame0**'s result.
7. read the frame5;
processing the frame5 using **queue0**;
wait 40 ms;
...
}
The code means that, I use different queues for reading and processing the same frame in a video.
However, my experiment result is faster, but just 2 times faster, but not in my imaginary speed.
Can anyone tell me how to deal it? THX!

Assuming you have one Device, here are some thoughts on this point:
Main reason to have multiple Command Queues (CQ) per single OpenCL Device is the ability to execute kernels & do IO operations simultaneously.
Usually one CQ is enought to load single Device at ~100%. Though, your multi-CQ idea is good (in my opinion), as you're constantly feeding GPU with workload.
Look at kernel execution time. May be, it's big enough, so that your Device is constantly executing kernels & can't go any faster.
I think, you don't need to wait for 40ms. Good solution is to process frames in queue, in which they are put to eliminate the difference between bitstream & display order.
If you have too many CQ, your OpenCL driver thread will be busy maintaining them, so that performance may decrease.

Related

Measure the electricity consumed by a browser to render a webpage

Is there a way to calculate the electricity consumed to load and render a webpage (frontend)? I was thinking of a 'test' made with phantomjs for example:
load a web page
scroll to the bottom
And measure how much electricity was needed. I can perhaps extrapolate from CPU cycle. But phantomjs is headless, rendering in real browser is certainly different. Perhaps it's impossible to do real measurements.. but with an index it may be possible to compare websites.
Do you have other suggestions?

It's pretty much impossible to measure this internally in modern processors (anything more recent than 286). By internally, I mean by counting cycles. This is because different parts of the processor consume different levels of energy per cycle depending upon the instruction.
That said, you can make your measurements. Stick a power meter between the wall and the processor. Here's a procedure:
Measure the baseline energy usage, i.e. nothing running except the OS and the browser, and the browser completely static (i.e. not doing anything). You need to make sure that everything is stead state (SS) meaning start your measurements only after several minutes of idle.
Measure the usage doing the operation you want. Again, you want to avoid any start up and stopping work, so make sure you start measuring at least 15 seconds after you start the operation. Stopping isn't an issue since the browser will execute any termination code after you finish your measurement.
Sounds simple, right? Unfortunately, because of the nature of your measurements, there are some gotchas.
Do you recall your physics classes (or EE classes) that talked about signal to noise ratios? Well, a scroll down uses very little energy, so the signal (scrolling) is well in the noise (normal background processes). This means you have to take a LOT of samples to get anything useful.
Your browser startup energy usage, or anything else that uses a decent amount of processing, is much easier to measure (better signal to noise ratio).
Also, make sure you understand the underlying electronics. For example, power is VA (voltage*amperage) where both V and A are in phase. I don't think this will be an issue since I'm pretty sure they are in phase for computers. Also, any decent power meter understands the difference.
I'm guessing you intend to do this for mobile devices. Your measurements will only be roughly the same from processor to processor. This is due to architectural differences from generation to generation, and from manufacturer to manufacturer.
Good luck.

ALSA passthrough latency

I am working on an embedded Linux application with audio passthrough using ALSA. It has very stringent latency requirements.
The output buffer is as small as possible which results in an occasional (perhaps once an hour) underrun on the output. This is acceptable. However, when it occurs, it causes a "backup" in the capture buffer and the result is a creeping increase in latency.
There doesn't seem to be a reliable way to know how much output data was lost in order to discard the same amount of input. I can experiment, but even though it's an embedded application it needs to be device independent, so we need a reliable solution.
Does anyone know a way to determine how much data was lost, or if it is always one buffer, or have other suggestions?

If you do not want the PCM devices to stop on an underrun/overrun, configure them not to stop by setting the stop threshold to the boundary value. Then they will just continue to run, and the number of available frames will continue to increase (for capture) or decrease (for playback). (Not all of those frames will be usable; the ring buffer just wraps around.)

x264 threading latency

I wonder why sliceless threading (http://akuvian.org/src/x264/sliceless_threads.txt) in x264 leads to latency? If I have for example 2 threads the first encode one frame and the second encode one frame. The seconds have to wait for the first in some cases. But they can be encoded in parallel.
So two threads should be faster than only one, right?

Frame-threading add latency in frames not in seconds because you need to feed encoder with more input frames before you start getting output frames (to fill pipeline). Encoding one frame itself will take about near same processor time as with one thread but threading allow pipeline process by encoding different frames parallel. From other hand sliced-threading decrease latency because all threads encode one frame parallel so it would be finished faster than encoding it with one thread (also sliced-threading don't need latency in frames for pipepining).

It took me quite a while to reason through it, but the answer is Queuing Theory.
Each frame can be started when half of the previous frame has been encoded. But if parallelization is going to provide any benefit most (preferably all) threads should have a frame to work on. 5 threads means 5 frames. That is the pipeline. Any time the pipeline is not completely full, parallelization is giving you less of a benefit. If the pipeline contains only one frame, only one thread is working and therefore you get no benefit from parallelization. But if your pipeline is usually full, what is it full of? Unencoded frames. Unencoded frames are frames that must have been captured and therefore they represent that many frames worth of latency. The latency might be slightly less by a small constant portion of a frame because some of those frames in the pipeline are partially encoded but in general each item in the pipeline contributes to the latency.

One reason for added latency with more threads is that the consecutive frames use each other for motion prediction and compensation. That means in order to compress a frame you need info from previous motion estimation details. That means the frames are dependant on each other and sometimes they have to wait for at least some data from other threads as well. This is in contrast with the slice threading when threads slicing up the frame and each one works on one slice and all on the same frame and they have all the needed info from previous frames, or next in case of B frames.

LabVIEW Real-Time Timed Loop resolution

we are using LabVIEW Real-Time with the PXI-8110 Controller.
I am facing the following problem:
I have a loop with 500µs period time (time-loop) and no other task. I write the time each loop iteration into ram and then save the data afterwords.
It is necessary that the period is exact, but I see that it is 500µs with +/- 25 µs.
The clock for the timed loop is 1 MHz.
How is it possible to have 500µs - 25µs. I would understand if I get 500µs + xx µs when my compution is to heavy. But till now I just do an addition nothing more.
So does anyone have a clue what is going wrong?
I thought it would be possible to have resolution of 1µs as NI advertise (if the computation isn't so heavy).
Thanks.

You may need to check which thread the code is working in. An easier way to work is to use the Timed Loop as this will try and correct for overruns. Also pre-allocate the array that you are storing the data into and then replace array subset which each new value. You should see a massive improvement with this way.
If you display that value and are running in development mode you will see jitter +- time as you are reporting everything back to the host. Build the executable and again jitter will shrink.

Is Scala doing anything in parallel on its own?

I have little program creating a maze. It uses lots of collections (the default variant, which is immutable, or at least used as an immutable).
The program calculates 30 mazes with increasing dimensions. Using a for comprehension over (1 to 30)
Since with the latest versions the parallel collections framework became available I thought to give it a spin, hoping for some performance gain.
This failed and when I investigated a little, I found the following:
When run without any call to anything remotely parallel it still showed a processor load of about 30% on each of the 4 cores of my machine.
When I replaced the Range 1 to 30 with (1 to 30).par CPU load went up to about 80% on all cores (which I expected). The order in which the mazes completed became more or less random (which I expected). The total time for all mazes stayed the same.
Replacing some of the internally used collections with their parallel counter parts did seem to have an effect.
I now have 2 questions:
Why do I have all 4 cores spinning, although there isn't anything that runs in parallel.
What might be likely reasons for the program to still take the same time, no matter if running in parallel or not. There are no obvious other bottlenecks but CPU cycles (no IO, no Network, plenty of Memory via -Xmx setting)
Any ideas on this?

The 30% per core version is just a poor scheduler (sounds like Windows 7) migrating the process from core to core very frequently. It's probably closer to 25% per core (1/4) for your process plus misc other load making 30%. If you run the same example under Linux you would probably see one core pegged.
When you converted to (1 to 30).par, you started really using threads across all cores but the synchronization overhead of distributing such a small amount of work and then collecting the results cancelled out the parallelism gains. You need to break your work into larger independent chunks.
EDIT: If each of 1..30 represents some larger amount of work (solving a maze, say) then automatic parallelization will work much better if each unit of work is about the same. Imagine you had 29 easy mazes and one very very hard maze. The 30th maze will still run serially (or very nearly) with everything else). If your mazes increase in complexity by number try spawning them in the order 30 to 1 by -1 so that the biggest tasks will go first. Think of it as a braindead solution to the knapsack problem.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse