UDP server consuming high CPU - sockets

I am observing high CPU usage in my UDP server implementation which runs an infinite loop expecting 15 1.5KB packets every milliseconds. It looks like below:
struct RecvContext
{
enum { BufferSize = 1600 };
RecvContext()
{
senderSockAddrLen = sizeof(sockaddr_storage);
memset(&overlapped, 0, sizeof(OVERLAPPED));
overlapped.hEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
memset(&sendersSockAddr, 0, sizeof(sockaddr_storage));
buffer.clear();
buffer.resize(BufferSize);
wsabuf.buf = (char*)buffer.data();
wsabuf.len = ULONG(buffer.size());
}
void CloseEventHandle()
{
if (overlapped.hEvent != INVALID_HANDLE_VALUE)
{
CloseHandle(overlapped.hEvent);
overlapped.hEvent = INVALID_HANDLE_VALUE;
}
}
OVERLAPPED overlapped;
int senderSockAddrLen;
sockaddr_storage sendersSockAddr;
std::vector<uint8_t> buffer;
WSABUF wsabuf;
};
void Receive()
{
DWORD flags = 0, bytesRecv = 0;
SOCKET sockHandle =...;
while (//stopping condition//)
{
std::shared_ptr<RecvContext> _recvContext = std::make_shared<IO::RecvContext>();
if (SOCKET_ERROR == WSARecvFrom(sockHandle, &_recvContext->wsabuf, 1, nullptr, &flags, (sockaddr*)&_recvContext->sendersSockAddr,
(LPINT)&_recvContext->senderSockAddrLen, &_recvContext->overlapped, nullptr))
{
if (WSAGetLastError() != WSA_IO_PENDING)
{
//error
}
else
{
if (WSA_WAIT_FAILED == WSAWaitForMultipleEvents(1, &_recvContext->overlapped.hEvent, FALSE, INFINITE, FALSE))
{
//error
}
if (!WSAGetOverlappedResult(sockHandle, &_recvContext->overlapped, &bytesRecv, FALSE, &flags))
{
//error
}
}
}
_recvContext->CloseEventHandle();
// async task to process _recvContext->buffer
}
}
The cpu consumption for this udp server is very high even when the packets are not being processed post receipt. How can the cpu consumption be improved here?

You've chosen about the most inefficient combination of mechanisms imaginable.
Why use overlapped I/O if you're only going to pend one operation and then wait for it complete?
Why use an event, which is about the slowest notification scheme that Windows has.
Why do you only pend one operation at a time? You're forcing the implementation to stash datagrams in its own buffers and then copy them into yours.
Why do you post the receive operation right before you're going to wait for it to complete rather than right after the previous one completes?
Why do you create a new receive context each time instead of re-using the existing buffer, event, and so on?
Use IOCP. Windows events are very slow and heavy.
Post lots of operations. You want the operating system to be able to put the datagram right in your buffer rather than having to allocate another buffer that it copies data into and out of.
Re-use your buffers and allocate all your receive buffers from a contiguous pool rather than fragmenting them throughout process memory. The memory used for your buffers has to be pinned and you want to minimize the amount of pinning needed.
Re-post operations as soon as they complete. Don't process them and then re-post. There's no reason to delay starting the operation. You can probably ignore this if you followed all the other suggestions because you wouldn't have a "spare" buffer to post anyway.
Alternatively, you can probably get away with having a thread that spins on a blocking receive operation. Just make sure your code has a loop that is as tight as possible, posting a different (already-allocated) buffer as soon as it returns after dispatching another thread to process the buffer it just filled with the receive operation.

Related

STM32 HAL UART receive by interrupt cleaning buffer

I'm working on an application where I process commands of fixed length received via UART.
I'm also using FreeRTOS and the task that handles the incoming commands is suspended until the uart interrupt handler is called, so my code is like this
void USART1_IRQHandler()
{
HAL_UART_IRQHandler(&huart1);
}
void HAL_UART_ErrorCallback(UART_HandleTypeDef *huart){
HAL_UART_Receive_IT(&huart1, uart_rx_buf, CMD_LEN);
}
void HAL_UART_RxCpltCallback(UART_HandleTypeDef *huart){
BaseType_t higherTaskReady = pdFALSE;
HAL_UART_Receive_IT(&huart1, uart_rx_buf, CMD_LEN); //restart interrupt handler
xSemaphoreGiveFromISR(uart_mutex, &higherTaskReady);
portYIELD_FROM_ISR( higherTaskReady); //Relase the semaphore
}
I am using the ErrorCallBack in case if an overflow occurs. Now I successfully catch every correct command, even if they are issued char by char.
However, I'm trying to make the system more error-proof by considering the case where more characters are received than expected.
The command length is 4 but if I receive, for example, 5 chars, then the first 4 is processed normally but when another command is received it starts from the last unprocessed char, so another 3 chars are needed until I can correctly process the commands again.
Luckily, the ErrorCallback is called whenever I receive more than 4 chars, so I know when it happens, but I need a robust way of cleaning the UART buffer so the previous chars are gone.
One solution I can think of is using UART receive 1 char at a time until it can't receive anymore, but is there a better way to simply flush the buffer?
Yes, the problem is the lack of delimiter, because every byte can can carry a value to be processed from 0 to 255. So, how can you detect the inconsistency?
My solution is a checksum byte in the protocol. If the checksum fails, a blocking-mode UART_Receive function is called in order to put the rest of the data from the "system-buffer" to a "disposable-buffer". In my example the fix size of the protocol is 6, I use the UART6 and I have a global variable RxBuffer. Here is the code:
void HAL_UART_RxCpltCallback(UART_HandleTypeDef *UartHandle)
{
if(UartHandle->Instance==USART6) {
if(your_checksum_is_ok) {
// You can process the incoming data
} else {
char TempBuffer;
HAL_StatusTypeDef hal_status;
do {
hal_status = HAL_UART_Receive(&huart6, (uint8_t*)&TempBuffer, 1, 10);
} while(hal_status != HAL_TIMEOUT);
}
HAL_UART_Receive_IT(&huart6, (uint8_t*)RxBuffer, 6);
}
}
void HAL_UART_ErrorCallback(UART_HandleTypeDef *UartHandle) {
if(UartHandle->Instance==USART6) {
HAL_UART_Receive_IT(&huart6, (uint8_t*)RxBuffer, 6);
}
}

Socket read often return -1 while the buffer is not empty

I am trying to test WiFi data transfer between cell phone and Esp32 (Arduino), when ESP32 reads file data via WiFi, even there is still data in, client.read() often return -1, I have to add other conditions to check reading finished or not.
My question is why there are so many failed reads, any ideas are highly appreciated.
void setup()
{
i=0;
Serial.begin(115200);
Serial.println("begin...");
// You can remove the password parameter if you want the AP to be open.
WiFi.softAP(ssid, password);
IPAddress myIP = WiFi.softAPIP();
Serial.print("AP IP address: ");
Serial.println(myIP);
server.begin();
Serial.println("Server started");
}
// the loop function runs over and over again until power down or reset
void loop()
{
WiFiClient client = server.available(); // listen for incoming clients
if(client) // if you get a client,
{
Serial.println("New Client."); // print a message out the serial port
Serial.println(client.remoteIP().toString());
while(client.connected()) // loop while the client's connected
{
while(client.available()>0) // if there's bytes to read from the client,
{
char c = client.read(); // read a byte, then
if(DOWNLOADFILE ==c){
pretime=millis();
uint8_t filename[32]={0};
uint8_t bFilesize[8];
long filesize;
int segment=0;
int remainder=0;
uint8_t data[512];
int len=0;
int totallen=0;
delay(50);
len=client.read(filename,32);
delay(50);
len=client.read(bFilesize,8);
filesize=BytesToLong(bFilesize);
segment=(int)filesize/512;
delay(50);
i=0; //succeed times
j=0; //fail times
////////////////////////////////////////////////////////////////////
//problem occures here, to many "-1" return value
// total read 24941639 bytes, succeed 49725 times, failed 278348 times
// if there were no read problems, it should only read 48,715 times and finish.
//But it total read 328,073 times, including 278,348 falied times, wasted too much time
while(((len=client.read(data,512))!=-1) || (totallen<filesize))
{
if(len>-1) {
totallen+=len;
i++;
}
else{
j++;
}
}
///loop read end , too many times read fail//////////////////////////////////////////////////////////////////
sprintf(toClient, "\nfile name %s,size %d, total read %d, segment %d, succeed %d times, failed %d times\n",filename,filesize,totallen,segment,i,j);
Serial.write(toClient);
curtime=millis();
sprintf(toClient, "time splashed %d ms, speed %d Bps\n", curtime-pretime, filesize*1000/(curtime-pretime));
Serial.write(toClient);
client.write(RETSUCCESS);
}
else
{
Serial.write("Unknow command\n");
}
}
}
// close the connection:
client.stop();
Serial.println("Client Disconnected.");
}
When you call available() and check for > 0, you are checking to see if there is one or more characters available to read. It will be true if just one character has arrived. You read one character, which is fine, but then you start reading more without stopping to see if there are more available.
TCP doesn't guarantee that if you write 100 characters to a socket that they all arrive at once. They can arrive in arbitrary "chunks" with arbitrary delays. All that's guaranteed is that they will eventually arrive in order (or if that's not possible because of networking issues, the connection will fail.)
In the absence of a blocking read function (I don't know if those exist) you have to do something like what you are doing. You have to read one character at a time and append it to a buffer, gracefully handing the possibility of getting a -1 (the next character isn't here yet, or the connection broke). In general you never want to try to read multiple characters in a single read(buf, len) unless you've just used available() to make sure len characters are actually available. And even that can fail if your buffers are really large. Stick to one-character-at-a-time.
It's a reasonable idea to call delay(1) when available() returns 0. In the places where you try to guess at something like delay(20) before reading a buffer you are rolling the dice - there's no promise that any amount of delay will guarantee bytes get delivered. Example: Maybe a drop of water fell on the chip's antenna and it won't work until the drop evaporates. Data could be delayed for minutes.
I don't know how available() behaves if the connection fails. You might have to do a read() and get back a -1 to diagnose a failed connection. The Arduino documentation is absolutely horrible, so you'll have to experiment.
TCP is much simpler to handle on platforms that have threads, blocking read, select() and other tools to manage data. Having only non-blocking read makes things harder, but there it is.
In some situations UDP is actually a lot simpler - there are more guarantees about getting messages of certain sizes in a single chunk. But of course whole messages can go missing or show up out of order. It's a trade-off.

Play sounds synchronously using snd_pcm_writei

I need to play sounds upon certain events, and want to minimize
processor load, because some image processing is being done too, and
processor performance is limited.
For the present, I play only one sound at a time, and I do it as
follows:
At program startup, sounds are read from .wav files
and the raw pcm data are loaded into memory
a sound device is opened (snd_pcm_open() in mode SND_PCM_NONBLOCK)
a worker thread is started which continously calls snd_pcm_writei()
as long as it is fed with data (data->remaining > 0).
Somewhat resumed, the worker thread function is
static void *Thread_Func (void *arg)
{
thrdata_t *data = (thrdata_t *)arg;
snd_pcm_sframes_t res;
while (1)
{ pthread_mutex_lock (&lock);
if (data->shall_stop)
{ data->shall_stop = false;
snd_pcm_drop (data->pcm_device);
snd_pcm_prepare (data->pcm_device);
data->remaining = 0;
}
if (data->remaining > 0)
{ res = snd_pcm_writei (data->pcm_device, data->bufptr, data->remaining);
if (res == -EAGAIN) continue;
if (res < 0) // error
{ fprintf (stderr, "snd_pcm_writeX() error: %s\n", snd_strerror(result));
snd_pcm_recover (data->sub_device, res);
}
else // another chunk has been handed over to sound hw
{ data->bufptr += res * bytes_per_frame;
data->remaining -= res;
}
if (data->remaining == 0) snd_pcm_prepare (data->pcm_device);
}
pthread_mutex_unlock (&lock);
usleep (sleep_us); // processor relief
}
} // Thread_Func
Ok, so this works well for one sound at a time. How do I play various?
I found dmix, but it seems a tool on user level, to mix streams coming
from separate programs.
Furthermore, I found the Simple Mixer Interface in the ALSA Project C
Library Interface, without any hint or example or tutorial about how
to use all these function described by one line of text each.
As a last resort I could calculate the mean value of all the buffers
to be played synchronously. So long I've been avoiding that, hoping
that an ALSA solution might use sound hardware resources, thus
relieving the main processor.
I'd be thankful for any hint about how to continue.

process Swift DispatchQueue without affecting resource

I have a Swift DispatchQueue that receives data at 60fps.
However, depending on phones or amount of data received, the computation of those data becomes expensive to process at 60fps. In actuality, it is okay to process only half of them or as much as the computation resource allows.
let queue = DispatchQueue(label: "com.test.dataprocessing")
func processData(data: SomeData) {
queue.async {
// data processing
}
}
Does DispatchQueue somehow allow me to drop some data if a resource is limited? Currently, it is affecting the main UI of SceneKit. Or, is there something better than DispatchQueue for this type of task?
There are a couple of possible approaches:
The simple solution is to keep track of your own Bool as to whether your task is in progress or not, and when you have more data, only process it if there's not one already running:
private var inProgress = false
private var syncQueue = DispatchQueue(label: Bundle.main.bundleIdentifier! + ".sync.progress") // for reasons beyond the scope of this question, reader-writer with concurrent queue is not appropriate here
func processData(data: SomeData) {
let isAlreadyRunning = syncQueue.sync { () -> Bool in
if self.inProgress { return true }
self.inProgress = true
return false
}
if isAlreadyRunning { return }
processQueue.async {
defer {
self.syncQueue.async { self.inProgress = false }
}
// process `data`
}
}
All of that syncQueue stuff is to make sure that I have thread-safe access to the inProgress property. But don't get lost in those details; use whatever synchronization mechanism you want (e.g. a lock or whatever). All we want to make sure is that we have thread-safe access to the Bool status flag.
Focus on the basic idea, that we'll keep track of a Bool flag to know whether the processing queue is still tied up processing the prior set of SomeData. If it is busy, return immediately and don't process this new data. Otherwise, go ahead and process it.
While the above approach is conceptually simple, it won't offer great performance. For example, if your processing of data always takes 0.02 seconds (50 times per second) and your input data is coming in at a rate of 60 times per second, you'll end up getting 30 of them processed per second.
A more sophisticated approach is to use a GCD user data source, something that says "run the following closure when the destination queue is free". And the beauty of these dispatch user data sources is that it will coalesce them together. These data sources are useful for decoupling the speed of inputs from the processing of them.
So, you first create a data source that simply indicates what should be done when data comes in:
private var dataToProcess: SomeData?
private lazy var source = DispatchSource.makeUserDataAddSource(queue: processQueue)
func configure() {
source.setEventHandler() { [unowned self] in
guard let data = self.syncQueue.sync(execute: { self.dataToProcess }) else { return }
// process `data`
}
source.resume()
}
So, when there's data to process, we update our synchronized dataToProcess property and then tell the data source that there is something to process:
func processData(data: SomeData) {
syncQueue.async { self.dataToProcess = data }
source.add(data: 1)
}
Again, just like the previous example, we're using syncQueue to synchronize our access to some property across multiple threads. But this time we're synchronizing dataToProcess rather than the inProgress state variable we used in the first example. But the idea is the same, that we must be careful to synchronize our interation with a property across multiple threads.
Anyway, using this pattern with the above scenario (input coming in at 60 fps, whereas processing can only process 50 per second), the resulting performance much closer to the theoretical max of 50 fps (I got between 42 and 48 fps depending upon the queue priority), rather than 30 fps.
The latter process can conceivably lead to more frames (or whatever you're processing) to be processed per second and results in less idle time on the processing queue. The following image attempts to graphically illustrate how the two alternatives compare. In the former approach, you'll lose every other frame of data, whereas the latter approach will only lose a frame of data when two separate sets of input data came in prior to the processing queue becoming free and they were coalesced into a single call to the dispatch source.

Async sockets in D

Okay this is my first question here on Stack Overflow, so bare over with it if I'm not asking properly.
Basically I'm trying to code some asynchronous sockets using std.socket, but I'm not sure if I've understood the concept correct. I've only ever worked with asynchronous sockets in C# and in D it seem to be on a much lower level. I've researched a lot and looked up a lot of code, documentation etc. both for D and C/C++ to get an understanding, however I'm not sure if I understand the concept correctly and if any of you have some examples. I tried looking at splat, but it's very outdated and vibe seems to be too complex just for a simple asynchronous socket wrapper.
If I understood correctly there is no poll() function in std.socket so you'd have to use SocketSet with a single socket on select() to poll the status of the socket right?
So basically how I'd go about handling the sockets is polling to get the read status of the socket and if it has a success (value > 0) then I can call receive() which will return 0 for disconnection else the received value, but I'd have to keep doing this until the expected bytes are received.
Of course the socket is set to nonblocked!
Is that correct?
Here is the code I've made up so far.
void HANDLE_READ()
{
while (true)
{
synchronized
{
auto events = cast(AsyncObject[int])ASYNC_EVENTS_READ;
foreach (asyncObject; events)
{
int poll = pollRecv(asyncObject.socket.m_socket);
switch (poll)
{
case 0:
{
throw new SocketException("The socket had a time out!");
continue;
}
default:
{
if (poll <= -1)
{
throw new SocketException("The socket was interrupted!");
continue;
}
int recvGetSize = (asyncObject.socket.m_readBuffer.length - asyncObject.socket.readSize);
ubyte[] recvBuffer = new ubyte[recvGetSize];
int recv = asyncObject.socket.m_socket.receive(recvBuffer);
if (recv == 0)
{
removeAsyncObject(asyncObject.event_id, true);
asyncObject.socket.disconnect();
continue;
}
asyncObject.socket.m_readBuffer ~= recvBuffer;
asyncObject.socket.readSize += recv;
if (asyncObject.socket.readSize == asyncObject.socket.expectedReadSize)
{
removeAsyncObject(asyncObject.event_id, true);
asyncObject.event(asyncObject.socket);
}
break;
}
}
}
}
}
}
So basically how I'd go about handling the sockets is polling to get the read status of the socket
Not quite right. Usually, the idea is to build an event loop around select, so that your application is idle as long as there are no network or timer events that need to be handled. With polling, you'd have to check for new events continuously or on a timer, which leads to wasted CPU cycles, and events getting handled a bit later than they occur.
In the event loop, you populate the SocketSets with sockets whose events you are interested in. If you want to be notified of new received data on a socket, it goes to the "readable" set. If you have data to send, the socket should be in the "writable" set. And all sockets should be on the "error" set.
select will then block (sleep) until an event comes in, and fill the SocketSets with the sockets which have actionable events. Your application can then respond to them appropriately: receive data for readable sockets, send queued data for writable sockets, and perform cleanup for errored sockets.
Here's my D implementation of non-fiber event-based networking: ae.net.asockets.