x264 threading latency - latency

I wonder why sliceless threading (http://akuvian.org/src/x264/sliceless_threads.txt) in x264 leads to latency? If I have for example 2 threads the first encode one frame and the second encode one frame. The seconds have to wait for the first in some cases. But they can be encoded in parallel.
So two threads should be faster than only one, right?

Frame-threading add latency in frames not in seconds because you need to feed encoder with more input frames before you start getting output frames (to fill pipeline). Encoding one frame itself will take about near same processor time as with one thread but threading allow pipeline process by encoding different frames parallel. From other hand sliced-threading decrease latency because all threads encode one frame parallel so it would be finished faster than encoding it with one thread (also sliced-threading don't need latency in frames for pipepining).

It took me quite a while to reason through it, but the answer is Queuing Theory.
Each frame can be started when half of the previous frame has been encoded. But if parallelization is going to provide any benefit most (preferably all) threads should have a frame to work on. 5 threads means 5 frames. That is the pipeline. Any time the pipeline is not completely full, parallelization is giving you less of a benefit. If the pipeline contains only one frame, only one thread is working and therefore you get no benefit from parallelization. But if your pipeline is usually full, what is it full of? Unencoded frames. Unencoded frames are frames that must have been captured and therefore they represent that many frames worth of latency. The latency might be slightly less by a small constant portion of a frame because some of those frames in the pipeline are partially encoded but in general each item in the pipeline contributes to the latency.

One reason for added latency with more threads is that the consecutive frames use each other for motion prediction and compensation. That means in order to compress a frame you need info from previous motion estimation details. That means the frames are dependant on each other and sometimes they have to wait for at least some data from other threads as well. This is in contrast with the slice threading when threads slicing up the frame and each one works on one slice and all on the same frame and they have all the needed info from previous frames, or next in case of B frames.

Related

Optimal size for a ring buffer with single producer and single consumer

I have a single producer, single consumer problem which (I believe) can be solved using a circular/ring buffer.
I have a micro-controller running a RTOS, with an ISR(Interrupt Service Routine) handling UART (Serial port) interrupts. When the UART raises an interrupt, the ISR posts the received characters into the circular buffer. In another RTOS task (called packet_handler), I am reading from this circular buffer and running a state machine to decode the packet. A valid packet has 64 bytes including all the framing bytes.
The UART operates at 115200, and a packet arrives every 10ms. The packet_handler runs every 10ms. However, this handler may sometimes get delayed by the scheduler due to another higher priority task executing.
If I use an arbitrarily large circular buffer, there are no packet drops. But how to determine the optimal circular buffer size in this case? At least theoretically. How is this decision of buffer size made in practice?
Currently, I am detecting overrun of the buffer through some instrumentation functions and then increasing the buffer size to reduce packet loss.
You won't be safe, ever, as you are dealing with a stochastic process (according to your explanation).
Answering your question: You will need an infinite buffer just in case the consumer task is in ready state for infinite seconds. So, you will have to change something in your initial approach:
Increase the priority of the consumer, in order to ensure the 10ms execution (the smallest buffer approach, but it may not be possible).
Try to get a better characterization of your model, in order to predict the maximum gap of time in which the consumer task won't be executed (do your system as predictable as possible).
Lose packages with a random buffer size (it may not be safe)
I would calculate in this way:
64 Byte received just know
64 Byte still in the ring buffer
+100% to be save
===================
256 Byte Buffer
But this is just a guess. You had to do some worst case test with your buffer and then spend +100% to be save.
While all of the above answers are correct and throws light on the issue, this page summarizes all the factors to be considered while choosing the size of a ring buffer.
Some queuing models can be used to theoretically analyze the problem at hand and find out the suitable size of ring buffer.
A more pragmatic approach is to start with a large buffer, then find out the maximum used buffer size in real test case (this process is called watermarking) and use this figure in the final code.
It is simply a matter of determining the maximum possible delay - the sum of the execution times of all higher priority tasks that can run - and dividing that by the packet arrival interval.
This may not be straightforward, but can be simplified by making only the most deterministic tasks higher priority and moving less deterministic and longer running tasks to lower priorities according to rate-monotonic principles Tasks that frequently run for a short time, but sporadically run for longer are candidates for being split into two tasks (and further queues) to offload the longer execution to a lower priority.

Why sliced thread affect so much on realtime encoding using ffmpeg x264?

I'm using ffmpeg libx264 to encode a 720p screen captured from x11 in realtime with a fps of 30.
when I use -tune zerolatency paramenter, the average encode time per-frame can be as large as 12ms with profile baseline.
After a study of the ffmpeg x264 source code, I found that the key parameter leading to such long encode time is sliced-threads which enabled by -tune zerolatency. After disabled using -x264-params sliced-threads=0 the encode time can be as low as 2ms
And with sliced-threads disabled, the CPU usage will be 40%, while only 20% when enabled.
Can someone explain the details about this sliced-thread? Especially in realtime encoding(assume no frame is buffered to be encoded. only encode when a frame is captured).
The documentation shows that frame-based threading has better throughput than slice-based. It also notes that the latter doesn't scale well due to parts of the encoder that are serial.
Speedup vs. encoding threads for the veryfast profile (non-realtime):
threads speedup psnr
slice frame slice frame
x264 --preset veryfast --tune psnr --crf 30
1: 1.00x 1.00x +0.000 +0.000
2: 1.41x 2.29x -0.005 -0.002
3: 1.70x 3.65x -0.035 +0.000
4: 1.96x 3.97x -0.029 -0.001
5: 2.10x 3.98x -0.047 -0.002
6: 2.29x 3.97x -0.060 +0.001
7: 2.36x 3.98x -0.057 -0.001
8: 2.43x 3.98x -0.067 -0.001
9: 3.96x +0.000
10: 3.99x +0.000
11: 4.00x +0.001
12: 4.00x +0.001
The main difference seems to be that frame threading adds frame latency as is needs different frames to work on, while in the case of slice-based threading all threads work on the same frame. In realtime encoding it would need to wait for more frames to arrive to fill the pipeline as opposed to offline.
Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency.
From: Diary of an x264 Developer
Sliceless threading: example with 2 threads.
Start encoding frame #0. When it's half done, start encoding frame #1. Thread #1 now only has access to the top half of its reference frame, since the rest hasn't been encoded yet. So it has to restrict the motion search range. But that's probably ok (unless you use lots of threads on a small frame), since it's pretty rare to have such long vertical motion vectors. After a little while, both threads have encoded one row of macroblocks, so thread #1 still gets to use motion range = +/- 1/2 frame height. Later yet, thread #0 finishes frame #0, and moves on to frame #2. Thread #0 now gets motion restrictions, and thread #1 is unrestricted.
From: http://web.archive.org/web/20150307123140/http://akuvian.org/src/x264/sliceless_threads.txt
Therefore it makes sense to enable sliced-threads with -tune zereolatency as you need to send a frame as soon as possible rather then encode them efficiently (performance and quality wise).
Using too many threads on the contrary can impact performance as the overhead to maintain them can exceed the potential gains.

ALSA passthrough latency

I am working on an embedded Linux application with audio passthrough using ALSA. It has very stringent latency requirements.
The output buffer is as small as possible which results in an occasional (perhaps once an hour) underrun on the output. This is acceptable. However, when it occurs, it causes a "backup" in the capture buffer and the result is a creeping increase in latency.
There doesn't seem to be a reliable way to know how much output data was lost in order to discard the same amount of input. I can experiment, but even though it's an embedded application it needs to be device independent, so we need a reliable solution.
Does anyone know a way to determine how much data was lost, or if it is always one buffer, or have other suggestions?
If you do not want the PCM devices to stop on an underrun/overrun, configure them not to stop by setting the stop threshold to the boundary value. Then they will just continue to run, and the number of available frames will continue to increase (for capture) or decrease (for playback). (Not all of those frames will be usable; the ring buffer just wraps around.)

How to run a multi-queue code using OpenCL?

For example, I'm doing some image processing work on every frame of a video.
Every frame's processing using 200ms including writing, processing and reading.
And the fps is 25, in that case every two frames' distance is 40ms. Then the processing is too slow to show continuous result.
So here is my idea, I use multi-queues for this work.
In CPU part,
while(video is not over)
{
1. read the frame0;
processing the **frame0** using **queue0**;
wait 40 ms;
2. read the frame1;
processing the frame1 using **queue1**;
wait 40 ms;
3.4.5.
...(after 5 frames(just about the 200ms's processing time))
6. download the **frame0**'s result.
7. read the frame5;
processing the frame5 using **queue0**;
wait 40 ms;
...
}
The code means that, I use different queues for reading and processing the same frame in a video.
However, my experiment result is faster, but just 2 times faster, but not in my imaginary speed.
Can anyone tell me how to deal it? THX!
Assuming you have one Device, here are some thoughts on this point:
Main reason to have multiple Command Queues (CQ) per single OpenCL Device is the ability to execute kernels & do IO operations simultaneously.
Usually one CQ is enought to load single Device at ~100%. Though, your multi-CQ idea is good (in my opinion), as you're constantly feeding GPU with workload.
Look at kernel execution time. May be, it's big enough, so that your Device is constantly executing kernels & can't go any faster.
I think, you don't need to wait for 40ms. Good solution is to process frames in queue, in which they are put to eliminate the difference between bitstream & display order.
If you have too many CQ, your OpenCL driver thread will be busy maintaining them, so that performance may decrease.

AudioToolbox - Callback delay while recording

I've been working on a very specific project for iOS, lately, and my researches lead me to an almost final code. I've solved all the extreme difficulties I've found until now, but on this one I don't seem to have a clue (about the reason nor the possibility of solving it).
I set up my audioqueue (sample rate 44100, format LinearPCM, 16 bits per channel, 2 bytes per frame, 1 channel per frame...) and start recording the sound with 12 audio buffers. However, there seems to be a delay after every 4 callbacks.
The situation is the following: the first 4 callbacks are called with an interval each of about 2 ms. However, between the 4th and the 5th, there is a delay of about 60ms. The same thing happens between the 8th and the 9th, the 12th and 13th and on...
There seems to be a relation between the bytes per frame and the moment of the delay. I know this because if I change to 4 bytes per frame, I start having the delay between the 8th and the 9th, then between the 16th and the 17th, the 24th and the 25th... Nonetheless, there doesn't seem to be any relation between the moment of the delay and the number of buffers.
The callback function does only two things: store the audio data (inBuffer->mAudioData) on a array my class can use; and call another AudioQueueEnqueueBuffer, to put the current buffer back on the queue.
Did anyone go through this problem already? Does anyone know, at least, what could be the cause of it?
Thank you in advance.
The Audio Queue API seems to run on top of the RemoteIO Audio Unit API, who's real audio buffer size is probably unrelated to, and in your example larger than, whatever size your Audio Queue buffers are. So whenever a RemoteIO buffer is ready, a bunch of your smaller AQ buffers quickly get filled from it. And then you get a longer delay waiting for some larger buffer to be filled with samples.
If you want better controlled (more evenly spaced) buffer latency, try using the RemoteIo Audio Unit directly.