how to fix the problem while cpu store buffer cause data unconsistency? - x86-64

lets say in x86-64 arch, there are 2 cores, each core has a thread doing such thing: compare-and-swap a shared value(test the shared value if it's 0, change to 1), and then doing something else, after that, set the value to 0 again(in Loop), quite like a simple spinlock. I have a problem with that, if core-1 set the value to 1, core-2 is wait-busy(test the value), and then core-1 set the value to 0, cpu may doing such thing in timeline(when core-1 set val to 0):
time 0: core-1 set the new value to store buffer, and send "read invalidate" message to core-2
time 1: core-2 got msg and save it to invalidate queue, send ACK to core-1
time 2: core-1 got ACK flush store buffer
time 1.5 or 2.5 : core 2 flush invalidate queue
so if in time 0.5, core-1 read the value again, so it can get the newer data, but core 2 still got the dirty data, this is my guess, so will it happen just like this? if "yes", how to fix the problem? I don't think memory-barrier or LOCK bus may get anything help, additionally, does c++11 std::atomic value has such problem ?

Related

How to minimize latency when reading audio with ALSA?

When trying to acquire some signals in the frequency domain, I've encountered the issue of having snd_pcm_readi() take a wildly variable amount of time. This causes problems in the logic section of my code, which is time dependent.
I have that most of the time, snd_pcm_readi() returns after approximately 0.00003 to 0.00006 seconds. However, every 4-5 call to snd_pcm_readi() requires approximately 0.028 seconds. This is a huge difference, and causes the logic part of my code to fail.
How can I get a consistent time for each call to snd_pcm_readi()?
I've tried to experiment with the period size, but it is unclear to me what exactly it does even after re-reading the documentation multiple times. I don't use an interrupt driven design, I simply call snd_pcm_readi() and it blocks until it returns -- with data.
I can only assume that the reason it blocks for a variable amount of time, is that snd_pcm_readi() pulls data from the hardware buffer, which happens to already have data readily available for transfer to the "application buffer" (which I'm maintaining). However, sometimes, there is additional work to do in kernel space or on the hardware side, hence the function call takes longer to return in these cases.
What purpose does the "period size" serve when I'm not using an interrupt driven design? Can my problem be fixed at all by manipulation of the period size, or should I do something else?
I want to achieve that each call to snd_pcm_readi() takes approximately the same amount of time. I'm not asking for a real time compliant API, which I don't imagine ALSA even attempts to be, however, seeing a difference in function call time on the order of being 500 times longer (which is what I'm seeing!) then this is a real problem.
What can be done about it, and what should I do about it?
I would present a minimal reproducible example, but this isn't easy in my case.
Typically when reading and writing audio, the period size specifies how much data ALSA has reserved in DMA silicon. Normally the period size specifies your latency. So for example while you are filling a buffer for writing through DMA to the I2S silicon, one DMA buffer is already being written out.
If you have your period size too small, then the CPU doesn't have time to write audio out in the scheduled execution slot provided. Typically people aim for a minimum of 500 us or 1 ms in latency. If you are doing heavy forms of computation, then you may want to choose 5 ms or 10 ms of latency. You may choose even more latency if you are on a non-powerful embedded system.
If you want to push the limit of the system, then you can request the priority of the audio processing thread be increased. By increasing the priority of your thread, you ask the scheduler to process your audio thread before all other threads with lower priority.
One method for increasing priority taken from the gtkIOStream ALSA C++ OO classes is like so (taken from the changeThreadPriority method) :
/** Set the current thread's priority
\param priority <0 implies maximum priority, otherwise must be between sched_get_priority_max and sched_get_priority_min
\return 0 on success, error code otherwise
*/
static int changeThreadPriority(int priority){
int ret;
pthread_t thisThread = pthread_self(); // get the current thread
struct sched_param origParams, params;
int origPolicy, policy = SCHED_FIFO, newPolicy=0;
if ((ret = pthread_getschedparam(thisThread, &origPolicy, &origParams))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
printf("ALSA::Stream::changeThreadPriority : Current thread policy %d and priority %d\n", origPolicy, origParams.sched_priority);
if (priority<0) //maximum priority
params.sched_priority = sched_get_priority_max(policy);
else
params.sched_priority = priority;
if (params.sched_priority>sched_get_priority_max(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too high\n");
if (params.sched_priority<sched_get_priority_min(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too low\n");
if ((ret = pthread_setschedparam(thisThread, policy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_setschedparam - are you su or do you have permission to set this priority?\n");
if ((ret = pthread_getschedparam(thisThread, &newPolicy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
if(policy != newPolicy)
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_POLICY_ERROR, "requested scheduler policy is not correctly set\n");
printf("ALSA::Stream::changeThreadPriority : New thread priority changed to %d\n", params.sched_priority);
return 0;
}

What is the meaning of CANBUS function mode initilazing settings for STM32?

I want to understand meaning of the following function mode definition, there is explanation in the library. But I don't understand that because explanations are very short and not enough. I searched on the net I couldnt find any information about.
CAN_InitStructure.CAN_TTCM = DISABLE;
CAN_InitStructure.CAN_ABOM = DISABLE;
CAN_InitStructure.CAN_AWUM = DISABLE;
CAN_InitStructure.CAN_NART = ENABLE;
CAN_InitStructure.CAN_RFLM = DISABLE;
CAN_InitStructure.CAN_TXFP = ENABLE;
These are the names of the bits located in the CAN master control register (CAN_MCR). So, the proper source for their meaning is the reference manual. My following answer will be somewhat copy & paste from the reference manual, but I will try to explain these bits in detail.
TTCM (Time triggered communication mode): This bit activates the Time Triggered Communication (TTCAN) mode, which is an extension to the CAN standard. I don't know much about TTCAN, but as I understand, it assigns time windows to messages to satisfy some real-time requirements. So, normally this bit should remain 0.
ABOM (Automatic bus-off management): If the transmit error counter (TEC) becomes greater than 255, the CAN hardware switches to bus-off state. To recover, it must wait for the recovery sequence, 128 occurrences of 11 consecutive recessive bits. Only after that, the CAN hardware may return to the normal operating state. This bit controls the returning behavior. If it's 1, returning to normal state is automatic. Otherwise, software should make the request, provided that the recovery sequence has been observed.
AWUM (Automatic wakeup mode): The CAN module can be in one of 3 modes: Initialization mode, normal mode or sleep (low power) mode. Sleep mode is requested by the software. However, you have 2 options to exit sleep mode. If this bit is 0, then you have to exit sleep mode manually. You may enable CAN wakeup interrupt to inform you about bus activity, then exit the sleep mode in ISR. But if this bit is 1, the hardware returns to normal mode automatically when it detects bus activity.
NART (No automatic retransmission): Normally, CAN hardware retries to transmit a message if its previous attempts fail, because of arbitration lost etc. But if you make this bit 1, the transmitter does not retry. This is required when you use Time Triggered Communication (TTCAN). Otherwise, you should keep this bit 0.
RFLM (Receive FIFO locked mode): Your receive mailboxes have 3 levels depth, meaning that they can store maximum 3 messages before they are overrun. This bit controls what happens in case of mailbox overrun. Default behavior is to keep the oldest 2 messages and the newest one. For example, if you received 5 messages, the buffer keeps the messages 1, 2 & 5. However, if you make this bit 1, the mailbox keeps the messages 1, 2 & 3 and discards the new arrivals.
TXFP (Transmit FIFO priority): You have 3 transmit mailboxes. When you fill more than one, the hardware must decide which one to transmit first. Normally, one can assume that a message with a lower ID number is more important and should be transmitted first. But if you want to transfer them in a first-comes-first-served fashion for some reason, you need to make this bit 1. Of course, this is just a local priority. On the physical bus, the messages with lower ID always have priority.

Variable Multithread Access - Corruption

In a nutshell:
I have one counter variable that is accessed from many threads. Although I've implemented multi-thread read/write protections, the variable seems to still -in an inconsistent way- get written to simultaneously, leading to incorrect results from the counter.
Getting into the weeds:
I'm using a "for loop" that triggers roughly 100 URL requests in the background, each in its “DispatchQueue.global(qos: .userInitiated).async” queue.
These processes are async, once they finish they update a “counter” variable. This variable is supposed to be multi-thread protected, meaning it’s always accessed from one thread and it’s accessed syncronously. However, something is wrong, from time to time the variable will be accessed simultaneously by two threads leading to the counter not updating correctly. Here's an example, lets imagine we have 5 URLs to fetch:
We start with the Counter variable at 5.
1 URL Request Finishes -> Counter = 4
2 URL Request Finishes -> Counter = 3
3 URL Request Finishes -> Counter = 2
4 URL Request Finishes (and for some reason – I assume variable is accessed at the same time) -> Counter 2
5 URL Request Finishes -> Counter = 1
As you can see, this leads to the counter being 1, instead of 0, which then affects other parts of the code. This error happens inconsistently.
Here is the multi-thread protection I use for the counter variable:
Dedicated Global Queue
//Background queue to syncronize data access fileprivate let
globalBackgroundSyncronizeDataQueue = DispatchQueue(label:
"globalBackgroundSyncronizeSharedData")
Variable is always accessed via accessor:
var numberOfFeedsToFetch_Value: Int = 0
var numberOfFeedsToFetch: Int {
set (newValue) {
globalBackgroundSyncronizeDataQueue.sync() {
self.numberOfFeedsToFetch_Value = newValue
}
}
get {
return globalBackgroundSyncronizeDataQueue.sync {
numberOfFeedsToFetch_Value
}
}
}
I assume I may be missing something but I've used profiling and all seems to be good, also checked the documentation and I seem to be doing what they recommend. Really appreciate your help.
Thanks!!
Answer from Apple Forums:https://forums.developer.apple.com/message/322332#322332:
The individual accessors are thread safe, but an increment operation
isn't atomic given how you've written the code. That is, while one
thread is getting or setting the value, no other threads can also be
getting or setting the value. However, there's nothing preventing
thread A from reading the current value (say, 2), thread B reading the
same current value (2), each thread adding one to this value in their
private temporary, and then each thread writing their incremented
value (3 for both threads) to the property. So, two threads
incremented but the property did not go from 2 to 4; it only went from
2 to 3. You need to do the whole increment operation (get, increment
the private value, set) in an atomic way such that no other thread can
read or write the property while it's in progress.

libspotify C sending zeros at the end of track

I'm using libspotify SDK, C library for win32.
I think to have a right setup, every session callback is registered. I don't understand why i can't receive the call for end_of_track, while music_delivery continues to be called with zero padding 22050 long frames.
I attempt to start playing first loading the track with sp_session_load; till it returns SP_ERROR_IS_LOADING I post a message on my message queue (synchronization method I've used, PostMessage win32 API) in order to reload again with same API sp_session_load. As soon as it returns SP_ERROR_OK I use the sp_session_play and the music_delivery starts immediately, with correct frames.
I don't know why at the end of track the libspotify runtime then start sending zero padded frames, instead of calling end_of_track callback.
In other conditions it works perfectly: I've used the sp_track obtained from a album browse, so the track is fully loaded at the moment I load to the current session for playing: with this track, it works fine with end_of_track called correctly. In the case with padding error, I search the track using its Spotify URI and got the results; in this case the track metadata are not still ready (at the play attempt) so I used that kind of "polling" on sp_session_load with PostMessage.
Can anybody help me?
I ran into the same problem and I think the issue was that I was consuming the data too fast without giving other threads time to do any work since I was spending all of my time in the music_delivery callback. I found that if I add some throttling and notify the main thread that it can wake up to do some processing, the extra zeros at the end of track is reduced to one delivery of 22,050 frames (or 500ms at 44.1kHz).
Here is an example of what I added to my callback, heavily borrowed from the jukebox.c example provided with the SDK:
/* Buffer 1 second of data, then notify the main thread to do some processing */
if (g_throttle > format->sample_rate) {
pthread_mutex_lock(&g_notify_mutex);
g_notify_do = 1;
pthread_cond_signal(&g_notify_cond);
pthread_mutex_unlock(&g_notify_mutex);
// Reset the throttle counter
g_throttle = 0;
return 0;
}
As I said, there was still 22,050 frames of zeros delivered before the track stopped, but I believe libspotify may purposely do this to ensure that the duration calculated by the number of frames received (song_duration_ms = total_frames_delivered / sample_rate * 1000) is greater than or equal to the duration reported by sp_track_duration. In my case, the track I was trying to stream was 172,000ms in duration, without the extra padding the duration calculated is 171,796ms, but with the padding it was 172,296ms.
Hope this helps.

Weird Winsock recv() slowdown

I'm writing a little VOIP app like Skype, which works quite good right now, but I've run into a very strange problem.
In one thread, I'm calling within a while(true) loop the winsock recv() function twice per run to get data from a socket.
The first call gets 2 bytes which will be casted into a (short) while the second call gets the rest of the message which looks like:
Complete Message: [2 Byte Header | Message, length determined by the 2Byte Header]
These packets are round about 49/sec which will be round about 3000bytes/sec.
The content of these packets is audio-data that gets converted into wave.
With ioctlsocket() I determine wether there is some data on the socket or not at each "message" I receive (2byte+data). If there's something on the socket right after I received a message within the while(true) loop of the thread, the message will be received, but thrown away to work against upstacking latency.
This concept works very well, but here's the problem:
While my VOIP program is running and when I parallely download (e.g. via browser) a file, there always gets too much data stacked on the socket, because while downloading, the recv() loop seems actually to slow down. This happens in every download/upload situation besides the actual voip up/download.
I don't know where this behaviour comes from, but when I actually cancel every up/download besides the voip traffic of my application, my apps works again perfectly.
If the program runs perfectly, the ioctlsocket() function writes 0 into the bytesLeft var, defined within the class where the receive function comes from.
Does somebody know where this comes from? I'll attach my receive function down below:
std::string D_SOCKETS::receive_message(){
recv(ClientSocket,(char*)&val,sizeof(val),MSG_WAITALL);
receivedBytes = recv(ClientSocket,buffer,val,MSG_WAITALL);
if (receivedBytes != val){
printf("SHORT: %d PAKET: %d ERROR: %d",val,receivedBytes,WSAGetLastError());
exit(128);
}
ioctlsocket(ClientSocket,FIONREAD,&bytesLeft);
cout<<"Bytes left on the Socket:"<<bytesLeft<<endl;
if(bytesLeft>20)
{
// message gets received, but ignored/thrown away to throw away
return std::string();
}
else
return std::string(buffer,receivedBytes);}
There is no need to use ioctlsocket() to discard data. That would indicate a bug in your protocol design. Assuming you are using TCP (you did not say), there should not be any left over data if your 2byte header is always accurate. After reading the 2byte header and then reading the specified number of bytes, the next bytes you receive after that constitute your next message and should not be discarded simply because it exists.
The fact that ioctlsocket() reports more bytes available means that you are receiving messages faster than you are reading them from the socket. Make your reading code run faster, don't throw away good data due to your slowness.
Your reading model is not efficient. Instead of reading 2 bytes, then X bytes, then 2 bytes, and so on, you should instead use a larger buffer to read more raw data from the socket at one time (use ioctlsocket() to know how many bytes are available, and then read at least that many bytes at one time and append them to the end of your buffer), and then parse as many complete messages are in the buffer before then reading more raw data from the socket again. The more data you can read at a time, the faster you can receive data.
To help speed up the code even more, don't process the messages inside the loop directly, either. Do the processing in another thread instead. Have the reading loop put complete messages in a queue and go back to reading, and then have a processing thread pull from the queue whenever messages are available for processing.