How to minimize latency when reading audio with ALSA? - latency

When trying to acquire some signals in the frequency domain, I've encountered the issue of having snd_pcm_readi() take a wildly variable amount of time. This causes problems in the logic section of my code, which is time dependent.
I have that most of the time, snd_pcm_readi() returns after approximately 0.00003 to 0.00006 seconds. However, every 4-5 call to snd_pcm_readi() requires approximately 0.028 seconds. This is a huge difference, and causes the logic part of my code to fail.
How can I get a consistent time for each call to snd_pcm_readi()?
I've tried to experiment with the period size, but it is unclear to me what exactly it does even after re-reading the documentation multiple times. I don't use an interrupt driven design, I simply call snd_pcm_readi() and it blocks until it returns -- with data.
I can only assume that the reason it blocks for a variable amount of time, is that snd_pcm_readi() pulls data from the hardware buffer, which happens to already have data readily available for transfer to the "application buffer" (which I'm maintaining). However, sometimes, there is additional work to do in kernel space or on the hardware side, hence the function call takes longer to return in these cases.
What purpose does the "period size" serve when I'm not using an interrupt driven design? Can my problem be fixed at all by manipulation of the period size, or should I do something else?
I want to achieve that each call to snd_pcm_readi() takes approximately the same amount of time. I'm not asking for a real time compliant API, which I don't imagine ALSA even attempts to be, however, seeing a difference in function call time on the order of being 500 times longer (which is what I'm seeing!) then this is a real problem.
What can be done about it, and what should I do about it?
I would present a minimal reproducible example, but this isn't easy in my case.

Typically when reading and writing audio, the period size specifies how much data ALSA has reserved in DMA silicon. Normally the period size specifies your latency. So for example while you are filling a buffer for writing through DMA to the I2S silicon, one DMA buffer is already being written out.
If you have your period size too small, then the CPU doesn't have time to write audio out in the scheduled execution slot provided. Typically people aim for a minimum of 500 us or 1 ms in latency. If you are doing heavy forms of computation, then you may want to choose 5 ms or 10 ms of latency. You may choose even more latency if you are on a non-powerful embedded system.
If you want to push the limit of the system, then you can request the priority of the audio processing thread be increased. By increasing the priority of your thread, you ask the scheduler to process your audio thread before all other threads with lower priority.
One method for increasing priority taken from the gtkIOStream ALSA C++ OO classes is like so (taken from the changeThreadPriority method) :
/** Set the current thread's priority
\param priority <0 implies maximum priority, otherwise must be between sched_get_priority_max and sched_get_priority_min
\return 0 on success, error code otherwise
*/
static int changeThreadPriority(int priority){
int ret;
pthread_t thisThread = pthread_self(); // get the current thread
struct sched_param origParams, params;
int origPolicy, policy = SCHED_FIFO, newPolicy=0;
if ((ret = pthread_getschedparam(thisThread, &origPolicy, &origParams))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
printf("ALSA::Stream::changeThreadPriority : Current thread policy %d and priority %d\n", origPolicy, origParams.sched_priority);
if (priority<0) //maximum priority
params.sched_priority = sched_get_priority_max(policy);
else
params.sched_priority = priority;
if (params.sched_priority>sched_get_priority_max(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too high\n");
if (params.sched_priority<sched_get_priority_min(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too low\n");
if ((ret = pthread_setschedparam(thisThread, policy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_setschedparam - are you su or do you have permission to set this priority?\n");
if ((ret = pthread_getschedparam(thisThread, &newPolicy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
if(policy != newPolicy)
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_POLICY_ERROR, "requested scheduler policy is not correctly set\n");
printf("ALSA::Stream::changeThreadPriority : New thread priority changed to %d\n", params.sched_priority);
return 0;
}

Related

What is the meaning of CANBUS function mode initilazing settings for STM32?

I want to understand meaning of the following function mode definition, there is explanation in the library. But I don't understand that because explanations are very short and not enough. I searched on the net I couldnt find any information about.
CAN_InitStructure.CAN_TTCM = DISABLE;
CAN_InitStructure.CAN_ABOM = DISABLE;
CAN_InitStructure.CAN_AWUM = DISABLE;
CAN_InitStructure.CAN_NART = ENABLE;
CAN_InitStructure.CAN_RFLM = DISABLE;
CAN_InitStructure.CAN_TXFP = ENABLE;
These are the names of the bits located in the CAN master control register (CAN_MCR). So, the proper source for their meaning is the reference manual. My following answer will be somewhat copy & paste from the reference manual, but I will try to explain these bits in detail.
TTCM (Time triggered communication mode): This bit activates the Time Triggered Communication (TTCAN) mode, which is an extension to the CAN standard. I don't know much about TTCAN, but as I understand, it assigns time windows to messages to satisfy some real-time requirements. So, normally this bit should remain 0.
ABOM (Automatic bus-off management): If the transmit error counter (TEC) becomes greater than 255, the CAN hardware switches to bus-off state. To recover, it must wait for the recovery sequence, 128 occurrences of 11 consecutive recessive bits. Only after that, the CAN hardware may return to the normal operating state. This bit controls the returning behavior. If it's 1, returning to normal state is automatic. Otherwise, software should make the request, provided that the recovery sequence has been observed.
AWUM (Automatic wakeup mode): The CAN module can be in one of 3 modes: Initialization mode, normal mode or sleep (low power) mode. Sleep mode is requested by the software. However, you have 2 options to exit sleep mode. If this bit is 0, then you have to exit sleep mode manually. You may enable CAN wakeup interrupt to inform you about bus activity, then exit the sleep mode in ISR. But if this bit is 1, the hardware returns to normal mode automatically when it detects bus activity.
NART (No automatic retransmission): Normally, CAN hardware retries to transmit a message if its previous attempts fail, because of arbitration lost etc. But if you make this bit 1, the transmitter does not retry. This is required when you use Time Triggered Communication (TTCAN). Otherwise, you should keep this bit 0.
RFLM (Receive FIFO locked mode): Your receive mailboxes have 3 levels depth, meaning that they can store maximum 3 messages before they are overrun. This bit controls what happens in case of mailbox overrun. Default behavior is to keep the oldest 2 messages and the newest one. For example, if you received 5 messages, the buffer keeps the messages 1, 2 & 5. However, if you make this bit 1, the mailbox keeps the messages 1, 2 & 3 and discards the new arrivals.
TXFP (Transmit FIFO priority): You have 3 transmit mailboxes. When you fill more than one, the hardware must decide which one to transmit first. Normally, one can assume that a message with a lower ID number is more important and should be transmitted first. But if you want to transfer them in a first-comes-first-served fashion for some reason, you need to make this bit 1. Of course, this is just a local priority. On the physical bus, the messages with lower ID always have priority.

How to modify bit bash sequence for write delays and read delays of DUT?

I have a DUT were the writes takes 2 clock cycles and reads consume 2 clock cycles before it could actually happen, I use regmodel and tried using inbuilt sequence uvm_reg_bit_bash_seq but it seems that the writes and reads happens at 1 clock cycle delay, could anyone tell what is the effective way to model 2 clock cycle delays and verify this, so that the DUT delays are taken care.
Facing the following error now,
Writing a 1 in bit #0 of register "ral_pgm.DIFF_STAT_CORE1" with initial value 'h0000000000000000 yielded 'h0000000000000000 instead of 'h0000000000000001
I have found one way of doing it, took the existing uvm_reg_single_bit_bash_seq and modified by adding p_sequencer and added 2 clock cycle delays after write and read method calls as per the DUT latency, this helped me in fixing the issue as well added a get call after write method to avoid fetching old value after read operation.
...
`uvm_declare_p_sequencer(s_master_sequencer)
rg.write(status, val, UVM_FRONTDOOR, map, this);
wait_for_clock(2); // write latency
rg.read(status, val, UVM_FRONTDOOR, map, this);
wait_for_clock(2); // read latency
if (status != UVM_IS_OK) begin
`uvm_error("mtm_reg_bit_bash_seq", $sformatf("Status was %s when reading register \"%s\" through map \"%s\".",
status.name(), rg.get_full_name(), map.get_full_name()));
end
val = rg.get(); // newly added method call (get) to fetch value after read
...
task wait_for_clock( int m = 1 );
repeat ( m ) begin
#(posedge p_sequencer.vif.CLK);
end
endtask: wait_for_clock
...

Triggering Interrupt for any byte received

I'm trying to get a code to work that triggers an interrupt for a variable data size coming to a RX input of a STM32 board (not discovery) in DMA Circular mode. ex.:CONNECTED\r\nDATAREQUEST\r\n
So far so good, I'm being able to receive data and all, while also triggering the DMA interrupt.
I will then create a sub RX message processing buffer breaking down each \r\n to a different char array pointer.
msgProcessingBuffer[0] = "COM_OK"
msgProcessingBuffer[1] = "DATAREQUEST"
msgProcessingBuffer[n] = "BlahBlahBlah"
My problem comes actually from the trigger of the interrupt. I would like to trigger the interrupt from any amount of data and processing any data received.
If I use the interrupt request bellow:
HAL_UART_Receive_DMA(&huart1,uart1RxMsgBuffer, 30);
The input buffer will take 30 bytes to trigger the interrupt, but that's too much time to wait because I would like to process the RX data as soon as a \r\n is found in the string. So I cannot wait for the full buffer to fill to begin processing it.
If I use the interrupt request bellow:
HAL_UART_Receive_DMA(&huart1,uart1RxMsgBuffer, 1;
It will trigger as I want, but there is no point on using DMA in this case because it will trigger the interrupt for every byte and will create a buffer of just 1 byte (duh) just like in "polling mode".
So my question is, how do I trigger the DMA for the first byte received but still receive/process all data that might come after it in a single interrupt? I believe I might be missing some basic concept here.
Best regards,
Blukrr
In short: HAL/SPL libraries don't provide such feachures.
Generally some MCUs, for example STM32F091VCT6 have hardware supporting of Modbus and byte flow analysis (interrupt by recieve some control byte) - so if you will use such MCU in you project, you can configure receive by circular DMA with interrupts by receive '\r' or '\n' byte.
And I repeat: HAL or SPL don't support this features, you can use it only throught work with registers (see reference manuals).
I was taking a look at some other forums and I've found there a work around for this problem.
I'm using a DMA in circular mode and then I monitor the NDTR which updates its value every time a byte is received through the UART interface. Then I cyclically call a function (in while 1 loop or in a cyclic interrupt handler) that break down each message part always looking for /n /r chars. This function also saves the current NDTR value for comparison if it has changed since the last "while 1" cycle. If the NDTR has changed since last cycle I wait a couple milliseconds to receive the remaining message (UART it's too slow to transmit) and then save those received messages in a char buffer array for post processing.
If you create a circular DMA buffer of about 50 bytes (HAL_UART_Receive_DMA(&huart1,uart1RxMsgBuffer, 50)) I think it's enough to compensate any fluctuations in the program cycle.
In the mean time I opened a ticket to ST and they confirmed what you just said they also added:
SOLUTION PROPOSED BY SUPPORTER - 14/4/2016 16:45:22 :
Hi Gilberto,
The DMA interrupt requests available are listed on Table 50 of the Reference Manual, RM0090, http://www.st.com/web/en/resource/technical/document/reference_manual/DM00031020.pdf. Therefore, basically, the DMA interrupt can only trigger at the end of one of these events.
• Half-transfer reached
• Transfer complete
• Transfer error
• Fifo error (overrun, underrun or FIFO level error)
• Direct mode error
Getting a DMA interrupt to trigger upon reception of a specific character in your receive data stream is not possible. You may want to trigger the interrupt when you receive packets of say 30 bytes each and then process the datastring to check if your \r\n chars have arrived so you can process the data block.
Regards,
MCU Tech Support

libspotify C sending zeros at the end of track

I'm using libspotify SDK, C library for win32.
I think to have a right setup, every session callback is registered. I don't understand why i can't receive the call for end_of_track, while music_delivery continues to be called with zero padding 22050 long frames.
I attempt to start playing first loading the track with sp_session_load; till it returns SP_ERROR_IS_LOADING I post a message on my message queue (synchronization method I've used, PostMessage win32 API) in order to reload again with same API sp_session_load. As soon as it returns SP_ERROR_OK I use the sp_session_play and the music_delivery starts immediately, with correct frames.
I don't know why at the end of track the libspotify runtime then start sending zero padded frames, instead of calling end_of_track callback.
In other conditions it works perfectly: I've used the sp_track obtained from a album browse, so the track is fully loaded at the moment I load to the current session for playing: with this track, it works fine with end_of_track called correctly. In the case with padding error, I search the track using its Spotify URI and got the results; in this case the track metadata are not still ready (at the play attempt) so I used that kind of "polling" on sp_session_load with PostMessage.
Can anybody help me?
I ran into the same problem and I think the issue was that I was consuming the data too fast without giving other threads time to do any work since I was spending all of my time in the music_delivery callback. I found that if I add some throttling and notify the main thread that it can wake up to do some processing, the extra zeros at the end of track is reduced to one delivery of 22,050 frames (or 500ms at 44.1kHz).
Here is an example of what I added to my callback, heavily borrowed from the jukebox.c example provided with the SDK:
/* Buffer 1 second of data, then notify the main thread to do some processing */
if (g_throttle > format->sample_rate) {
pthread_mutex_lock(&g_notify_mutex);
g_notify_do = 1;
pthread_cond_signal(&g_notify_cond);
pthread_mutex_unlock(&g_notify_mutex);
// Reset the throttle counter
g_throttle = 0;
return 0;
}
As I said, there was still 22,050 frames of zeros delivered before the track stopped, but I believe libspotify may purposely do this to ensure that the duration calculated by the number of frames received (song_duration_ms = total_frames_delivered / sample_rate * 1000) is greater than or equal to the duration reported by sp_track_duration. In my case, the track I was trying to stream was 172,000ms in duration, without the extra padding the duration calculated is 171,796ms, but with the padding it was 172,296ms.
Hope this helps.

High CPU and Memory Consumption on using boost::asio async_read_some

I have made a server that reads data from client and I am using boost::asio async_read_some for reading data, and I have made one handler function and here _ioService->poll() will run event processing loop to execute ready handlers. In handler _handleAsyncReceive I am deallocating the buf that is assigned in receiveDataAsync. bufferSize is 500.
Code is as follows:
bool
TCPSocket::receiveDataAsync( unsigned int bufferSize )
{
char *buf = new char[bufferSize + 1];
try
{
_tcpSocket->async_read_some( boost::asio::buffer( (void*)buf, bufferSize ),
boost::bind(&TCPSocket::_handleAsyncReceive,
this,
buf,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred) );
_ioService->poll();
}
catch (std::exception& e)
{
LOG_ERROR("Error Receiving Data Asynchronously");
LOG_ERROR( e.what() );
delete [] buf;
return false;
}
//we dont delete buf here as it will be deleted by callback _handleAsyncReceive
return true;
}
void
TCPSocket::_handleAsyncReceive(char *buf, const boost::system::error_code& ec, size_t size)
{
if(ec)
{
LOG_ERROR ("Error occurred while sending data Asynchronously.");
LOG_ERROR ( ec.message() );
}
else if ( size > 0 )
{
buf[size] = '\0';
LOG_DEBUG("Deleting Buffer");
emit _asyncDataReceivedSignal( QString::fromLocal8Bit( buf ) );
}
delete [] buf;
}
Here the problem is buffer is allocated at much faster rate as compare to deallocation so memory usage will go high at exponential rate and at some point of time it will consume all the memory and system will be stuck. CPU usage will also be around 90%. How can I reduce the memory and CPU consumption?
You have a memory leak. io_service poll does not guarantee that it with dispatch your _handleAsyncReceive. It can dispatch other event (e.g an accept), so your memory at char *buf is lost. My guess you are calling receiveDataAsync from a loop, but its not necessary - leak will exist in any case (with different leak speed).
Its better if you follow asio examples and work with suggested patterns rather than make your own.
You might consider using a wrap around buffer, which is also called a circular buffer. Boost has a template circular buffer version available. You can read about it here. The idea behind it is that when it becomes full, it circles around to the beginning where it will store things. You can do the same thing with other structures or arrays as well. For example, I currently use a byte array for this purpose in my application.
The advantage of using a dedicated large circular buffer to hold your messages is that you don't have to worry about creating and deleting memory for each new message that comes in. This avoids fragmentation of memory, which could become a problem.
To determine an appropriate size of the circular buffer, you need to think about the maximum number of messages that can come in and are in some stage of being processed simultaneously; multiply that number by the average size of the messages and then multiply by a fudge factor of perhaps 1.5. The average message size for my application is under 100 bytes. My buffer size is 1 megabyte, which would allow for at least 10,000 messages to accumulate without it affecting the wrap around buffer. But, if more than 10,000 messages did accumulate without being completely processed, then the circular buffer would be unuseable and the program would have to be restarted. I have been thinking about reducing the size of the buffer because the system would probably be dead long before it hit the 10,000 message mark.
As PSIAlt suggest, consider following the Boost.Asio examples and build upon their patterns for asynchronous programming.
Nevertheless, I would suggest considering whether multiple read calls need to be queued onto the same socket. If the application only allows for a single read operation to be pending on the socket, then resources are reduced:
There is no longer the scenario where there are an excessive amount of handlers pending in the io_service.
A single buffer can be preallocated and reused for each read operation. For example, the following asynchronous call chain only requires a single buffer, and allows for the concurrent execution of starting an asynchronous read operation while the previous data is being emitted on the Qt signal, as QString performs deep-copies.
TCPSocket::start()
{
receiveDataAsync(...) --.
} |
.---------------'
| .-----------------------------------.
v v |
TCPSocket::receiveDataAsync(...) |
{ |
_tcpSocket->async_read_some(_buffer); --. |
} | |
.-------------------------------' |
v |
TCPSocket::_handleAsyncReceive(...) |
{ |
QString data = QString::fromLocal8Bit(_buffer); |
receiveDataAsync(...); --------------------------'
emit _asyncDataReceivedSignal(data);
}
...
tcp_socket.start();
io_service.run();
It is important to identify when and where the io_service's event loop will be serviced. Generally, applications are designed so that the io_service does not run out of work, and the processing thread is simply waiting for events to occur. Thus, it is fairly common to start setting up asynchronous chains, then process the io_service event loop at a much higher scope.
On the other hand, if it is determined that TCPSocket::receiveDataAsync() should process the event loop in a blocking manner, then consider using synchronous operations.