High CPU and Memory Consumption on using boost::asio async_read_some - sockets

I have made a server that reads data from client and I am using boost::asio async_read_some for reading data, and I have made one handler function and here _ioService->poll() will run event processing loop to execute ready handlers. In handler _handleAsyncReceive I am deallocating the buf that is assigned in receiveDataAsync. bufferSize is 500.
Code is as follows:
bool
TCPSocket::receiveDataAsync( unsigned int bufferSize )
{
char *buf = new char[bufferSize + 1];
try
{
_tcpSocket->async_read_some( boost::asio::buffer( (void*)buf, bufferSize ),
boost::bind(&TCPSocket::_handleAsyncReceive,
this,
buf,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred) );
_ioService->poll();
}
catch (std::exception& e)
{
LOG_ERROR("Error Receiving Data Asynchronously");
LOG_ERROR( e.what() );
delete [] buf;
return false;
}
//we dont delete buf here as it will be deleted by callback _handleAsyncReceive
return true;
}
void
TCPSocket::_handleAsyncReceive(char *buf, const boost::system::error_code& ec, size_t size)
{
if(ec)
{
LOG_ERROR ("Error occurred while sending data Asynchronously.");
LOG_ERROR ( ec.message() );
}
else if ( size > 0 )
{
buf[size] = '\0';
LOG_DEBUG("Deleting Buffer");
emit _asyncDataReceivedSignal( QString::fromLocal8Bit( buf ) );
}
delete [] buf;
}
Here the problem is buffer is allocated at much faster rate as compare to deallocation so memory usage will go high at exponential rate and at some point of time it will consume all the memory and system will be stuck. CPU usage will also be around 90%. How can I reduce the memory and CPU consumption?

You have a memory leak. io_service poll does not guarantee that it with dispatch your _handleAsyncReceive. It can dispatch other event (e.g an accept), so your memory at char *buf is lost. My guess you are calling receiveDataAsync from a loop, but its not necessary - leak will exist in any case (with different leak speed).
Its better if you follow asio examples and work with suggested patterns rather than make your own.

You might consider using a wrap around buffer, which is also called a circular buffer. Boost has a template circular buffer version available. You can read about it here. The idea behind it is that when it becomes full, it circles around to the beginning where it will store things. You can do the same thing with other structures or arrays as well. For example, I currently use a byte array for this purpose in my application.
The advantage of using a dedicated large circular buffer to hold your messages is that you don't have to worry about creating and deleting memory for each new message that comes in. This avoids fragmentation of memory, which could become a problem.
To determine an appropriate size of the circular buffer, you need to think about the maximum number of messages that can come in and are in some stage of being processed simultaneously; multiply that number by the average size of the messages and then multiply by a fudge factor of perhaps 1.5. The average message size for my application is under 100 bytes. My buffer size is 1 megabyte, which would allow for at least 10,000 messages to accumulate without it affecting the wrap around buffer. But, if more than 10,000 messages did accumulate without being completely processed, then the circular buffer would be unuseable and the program would have to be restarted. I have been thinking about reducing the size of the buffer because the system would probably be dead long before it hit the 10,000 message mark.

As PSIAlt suggest, consider following the Boost.Asio examples and build upon their patterns for asynchronous programming.
Nevertheless, I would suggest considering whether multiple read calls need to be queued onto the same socket. If the application only allows for a single read operation to be pending on the socket, then resources are reduced:
There is no longer the scenario where there are an excessive amount of handlers pending in the io_service.
A single buffer can be preallocated and reused for each read operation. For example, the following asynchronous call chain only requires a single buffer, and allows for the concurrent execution of starting an asynchronous read operation while the previous data is being emitted on the Qt signal, as QString performs deep-copies.
TCPSocket::start()
{
receiveDataAsync(...) --.
} |
.---------------'
| .-----------------------------------.
v v |
TCPSocket::receiveDataAsync(...) |
{ |
_tcpSocket->async_read_some(_buffer); --. |
} | |
.-------------------------------' |
v |
TCPSocket::_handleAsyncReceive(...) |
{ |
QString data = QString::fromLocal8Bit(_buffer); |
receiveDataAsync(...); --------------------------'
emit _asyncDataReceivedSignal(data);
}
...
tcp_socket.start();
io_service.run();
It is important to identify when and where the io_service's event loop will be serviced. Generally, applications are designed so that the io_service does not run out of work, and the processing thread is simply waiting for events to occur. Thus, it is fairly common to start setting up asynchronous chains, then process the io_service event loop at a much higher scope.
On the other hand, if it is determined that TCPSocket::receiveDataAsync() should process the event loop in a blocking manner, then consider using synchronous operations.

Related

MPU-6050 Burst Read Auto Increment

I'm trying to write a driver for the MPU-6050 and I'm stuck on how to proceed regarding reading the raw accelerometer/gyroscope/temperature readings. For instance, the MPU-6050 has the accelerometer X readings in 2 registers: ACCEL_XOUT[15:8] at address 0x3B and ACCEL_XOUT[7:0] at address 0x3C. Of course to read the raw value I need to read both registers and put them together.
BUT
In the description of the registers (in the register map and description sheet, https://invensense.tdk.com/wp-content/uploads/2015/02/MPU-6000-Register-Map1.pdf) it says that to guarantee readings from the same sampling instant I must use burst reads b/c as soon as an idle I2C bus is detected, the sensor registers are refreshed with new data from a new sampling instant. The datasheet snippet shows the simple I2C burst read:
However, this approach (to the best of my understanding) would only work reading the ACCEL_X registers from the same sampling instant if the auto-increment was supported (such that the first DATA in the above sequence would be from ACCEL_XOUT[15:8] # address 0x3B and the second DATA would be from ACCEL_XOUT[7:0] # address 0x3C). But the datasheet (https://invensense.tdk.com/wp-content/uploads/2015/02/MPU-6000-Datasheet1.pdf) only mentions that I2C burst writes support the auto-increment feature. Without auto-increment on the I2C read side how would I go about reading two different registers whilst maintaining the same sampling instant?
I also recognize that I could use the sensor's FIFO feature or the interrupt to accomplish what I'm after, but (for my own curiosity) I would like a solution that didn't rely on either.
I also have the same problem, looks like the documentation on this topic is incomplete.
Reading single sample
I think you can burst read the ACCEL_*OUT_*, TEMP_OUT_* and GYRO_*OUT_*. In fact I tried reading the data one register at once, but I got frequent data corruption.
Then, just to try, I requested 6 bytes from ACCEL_XOUT_H, 6 bytes from GYRO_XOUT_H and 2 bytes from TEMP_OUT_H and... it worked! No more data corruption!
I think they simply forgot to mention this in the register map.
How to
Here is some example code that can work in the Arduino environment.
These are the function that I use, they are not very safe, but it works for my project:
////////////////////////////////////////////////////////////////
inline void requestBytes(byte SUB, byte nVals)
{
Wire.beginTransmission(SAD);
Wire.write(SUB);
Wire.endTransmission(false);
Wire.requestFrom(SAD, nVals);
while (Wire.available() == 0);
}
////////////////////////////////////////////////////////////////
inline byte getByte(void)
{
return Wire.read();
}
////////////////////////////////////////////////////////////////
inline void stopRead(void)
{
Wire.endTransmission(true);
}
////////////////////////////////////////////////////////////////
byte readByte(byte SUB)
{
requestBytes(SUB, 1);
byte result = getByte();
stopRead();
return result;
}
////////////////////////////////////////////////////////////////
void readBytes(byte SUB, byte* buff, byte count)
{
requestBytes(SUB, count);
for (int i = 0; i < count; i++)
buff[i] = getByte();
stopRead();
}
At this point, you can simply read the values in this way:
// ACCEL_XOUT_H
// burst read the registers using auto-increment:
byte data[6];
readBytes(ACCEL_XOUT_H, data, 6);
// convert the data:
acc_x = (data[0] << 8) | data[1];
// ...
Warning!
Looks like this cannot be done for other registers. For example, to read the FIFO_COUNT_* I have to do this (otherwise I get incorrect results):
uint16_t FIFO_size(void)
{
byte bytes[2];
// this does not work
//readBytes(FIFO_COUNT_H, bytes, 2);
bytes[1] = readByte(FIFO_COUNT_H);
bytes[2] = readByte(FIFO_COUNT_L);
return unisci_bytes(bytes[1], bytes[2]);
}
Reading the FIFO
Looks like the FIFO works differently: you can burst read by simply requesting multiple bytes from the FIFO_R_W register and the MPU6050 will give you the bytes in the FIFO without incrementing the register.
I found this example where they use I2Cdev::readByte(SAD, FIFO_R_W, buffer) to read a given number of bytes from the FIFO and if you look at I2Cdev::readByte() (here) it simply requests N bytes from the FIFO register:
// ... send FIFO_R_W and request N bytes ...
for(...; ...; count++)
data[count] = Wire.receive();
// ...
How to
This is simple since the FIFO_R_W does not auto-increment:
byte data[12];
void loop() {
// ...
readBytes(FIFO_R_W, data, 12); // <- replace 12 with your burst size
// ...
}
Warning!
Using FIFO_size() is very slow!
Also my advice is to use 400kHz I2C frequency, which is the MPU6050's maximum speed
Hope it helps ;)
As Luca says, the burst read semantic seems to be different depending on the register the read operation starts at.
Reading consistent samples
To read a consistent set of raw data values, you can use the method I2C.readRegister(int, ByteBuffer, int) with register number 59 (ACCEL_XOUTR[15:8]) and a length of 14 to read all the sensor data ACCEL, TEMP, and GYRO in one operation and get consistent data.
Burst read of FIFO data
However, if you use the FIFO buffer of the chip, you can start the burst read with the same method signature on register 116 (FIFO_R_W) to read the given amount of data from the chip-internal fifo buffer. Doing so you must keep in mind that there is a limit on the number of bytes that can be read in one burst operation. If I'm interpreting https://github.com/joan2937/pigpio/blob/c33738a320a3e28824af7807edafda440952c05d/pigpio.c#L3914 right, a maximum of 31 bytes can be read in a single burst operation.

How to minimize latency when reading audio with ALSA?

When trying to acquire some signals in the frequency domain, I've encountered the issue of having snd_pcm_readi() take a wildly variable amount of time. This causes problems in the logic section of my code, which is time dependent.
I have that most of the time, snd_pcm_readi() returns after approximately 0.00003 to 0.00006 seconds. However, every 4-5 call to snd_pcm_readi() requires approximately 0.028 seconds. This is a huge difference, and causes the logic part of my code to fail.
How can I get a consistent time for each call to snd_pcm_readi()?
I've tried to experiment with the period size, but it is unclear to me what exactly it does even after re-reading the documentation multiple times. I don't use an interrupt driven design, I simply call snd_pcm_readi() and it blocks until it returns -- with data.
I can only assume that the reason it blocks for a variable amount of time, is that snd_pcm_readi() pulls data from the hardware buffer, which happens to already have data readily available for transfer to the "application buffer" (which I'm maintaining). However, sometimes, there is additional work to do in kernel space or on the hardware side, hence the function call takes longer to return in these cases.
What purpose does the "period size" serve when I'm not using an interrupt driven design? Can my problem be fixed at all by manipulation of the period size, or should I do something else?
I want to achieve that each call to snd_pcm_readi() takes approximately the same amount of time. I'm not asking for a real time compliant API, which I don't imagine ALSA even attempts to be, however, seeing a difference in function call time on the order of being 500 times longer (which is what I'm seeing!) then this is a real problem.
What can be done about it, and what should I do about it?
I would present a minimal reproducible example, but this isn't easy in my case.
Typically when reading and writing audio, the period size specifies how much data ALSA has reserved in DMA silicon. Normally the period size specifies your latency. So for example while you are filling a buffer for writing through DMA to the I2S silicon, one DMA buffer is already being written out.
If you have your period size too small, then the CPU doesn't have time to write audio out in the scheduled execution slot provided. Typically people aim for a minimum of 500 us or 1 ms in latency. If you are doing heavy forms of computation, then you may want to choose 5 ms or 10 ms of latency. You may choose even more latency if you are on a non-powerful embedded system.
If you want to push the limit of the system, then you can request the priority of the audio processing thread be increased. By increasing the priority of your thread, you ask the scheduler to process your audio thread before all other threads with lower priority.
One method for increasing priority taken from the gtkIOStream ALSA C++ OO classes is like so (taken from the changeThreadPriority method) :
/** Set the current thread's priority
\param priority <0 implies maximum priority, otherwise must be between sched_get_priority_max and sched_get_priority_min
\return 0 on success, error code otherwise
*/
static int changeThreadPriority(int priority){
int ret;
pthread_t thisThread = pthread_self(); // get the current thread
struct sched_param origParams, params;
int origPolicy, policy = SCHED_FIFO, newPolicy=0;
if ((ret = pthread_getschedparam(thisThread, &origPolicy, &origParams))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
printf("ALSA::Stream::changeThreadPriority : Current thread policy %d and priority %d\n", origPolicy, origParams.sched_priority);
if (priority<0) //maximum priority
params.sched_priority = sched_get_priority_max(policy);
else
params.sched_priority = priority;
if (params.sched_priority>sched_get_priority_max(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too high\n");
if (params.sched_priority<sched_get_priority_min(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too low\n");
if ((ret = pthread_setschedparam(thisThread, policy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_setschedparam - are you su or do you have permission to set this priority?\n");
if ((ret = pthread_getschedparam(thisThread, &newPolicy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
if(policy != newPolicy)
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_POLICY_ERROR, "requested scheduler policy is not correctly set\n");
printf("ALSA::Stream::changeThreadPriority : New thread priority changed to %d\n", params.sched_priority);
return 0;
}

Weird Winsock recv() slowdown

I'm writing a little VOIP app like Skype, which works quite good right now, but I've run into a very strange problem.
In one thread, I'm calling within a while(true) loop the winsock recv() function twice per run to get data from a socket.
The first call gets 2 bytes which will be casted into a (short) while the second call gets the rest of the message which looks like:
Complete Message: [2 Byte Header | Message, length determined by the 2Byte Header]
These packets are round about 49/sec which will be round about 3000bytes/sec.
The content of these packets is audio-data that gets converted into wave.
With ioctlsocket() I determine wether there is some data on the socket or not at each "message" I receive (2byte+data). If there's something on the socket right after I received a message within the while(true) loop of the thread, the message will be received, but thrown away to work against upstacking latency.
This concept works very well, but here's the problem:
While my VOIP program is running and when I parallely download (e.g. via browser) a file, there always gets too much data stacked on the socket, because while downloading, the recv() loop seems actually to slow down. This happens in every download/upload situation besides the actual voip up/download.
I don't know where this behaviour comes from, but when I actually cancel every up/download besides the voip traffic of my application, my apps works again perfectly.
If the program runs perfectly, the ioctlsocket() function writes 0 into the bytesLeft var, defined within the class where the receive function comes from.
Does somebody know where this comes from? I'll attach my receive function down below:
std::string D_SOCKETS::receive_message(){
recv(ClientSocket,(char*)&val,sizeof(val),MSG_WAITALL);
receivedBytes = recv(ClientSocket,buffer,val,MSG_WAITALL);
if (receivedBytes != val){
printf("SHORT: %d PAKET: %d ERROR: %d",val,receivedBytes,WSAGetLastError());
exit(128);
}
ioctlsocket(ClientSocket,FIONREAD,&bytesLeft);
cout<<"Bytes left on the Socket:"<<bytesLeft<<endl;
if(bytesLeft>20)
{
// message gets received, but ignored/thrown away to throw away
return std::string();
}
else
return std::string(buffer,receivedBytes);}
There is no need to use ioctlsocket() to discard data. That would indicate a bug in your protocol design. Assuming you are using TCP (you did not say), there should not be any left over data if your 2byte header is always accurate. After reading the 2byte header and then reading the specified number of bytes, the next bytes you receive after that constitute your next message and should not be discarded simply because it exists.
The fact that ioctlsocket() reports more bytes available means that you are receiving messages faster than you are reading them from the socket. Make your reading code run faster, don't throw away good data due to your slowness.
Your reading model is not efficient. Instead of reading 2 bytes, then X bytes, then 2 bytes, and so on, you should instead use a larger buffer to read more raw data from the socket at one time (use ioctlsocket() to know how many bytes are available, and then read at least that many bytes at one time and append them to the end of your buffer), and then parse as many complete messages are in the buffer before then reading more raw data from the socket again. The more data you can read at a time, the faster you can receive data.
To help speed up the code even more, don't process the messages inside the loop directly, either. Do the processing in another thread instead. Have the reading loop put complete messages in a queue and go back to reading, and then have a processing thread pull from the queue whenever messages are available for processing.

How to get NSOutputStream to send or flush packets immediately

I am having an issue with latency when connecting to a bluetooth accessory using the External Accessory Framework. When sending data I get the following custom output in the console:
if( [stream hasSpaceAvailable] )
{
NSLog( #"Space avail" );
}
else {
NSLog(#"No space");
}
while( [stream hasSpaceAvailable] && ( [_outputBuffer length] > 0 ) )
{
/* write as many bytes as possible */
NSInteger written = [stream write:[_outputBuffer bytes] maxLength:[_outputBuffer length]];
NSLog( #"wrote %i out of %i bytes to the stream", written, [_outputBuffer length] );
if( written == -1 )
{
/* error, bad */
Log( #"Error writing bytes" );
break;
}
else if( written > 0 )
{
/* remove the bytes from the buffer that were written */
Log( #"erasing %i bytes", written );
[_outputBuffer replaceBytesInRange:NSMakeRange( 0, written ) withBytes:nil length:0 ];
}
}
This results with the following output where immediate pack buffer is the payload.
immediate pack buffer-> 040040008
Space avail
wrote 10 out of 10 bytes to the stream
immediate pack buffer-> 040010005
No space
immediate pack buffer-> 030040007
No space
wrote 20 out of 20 bytes to the stream
immediate pack buffer-> 030010004
No space
immediate pack buffer-> 040000004
Space avail
wrote 20 out of 20 bytes to the stream
immediate pack buffer-> 030000003
Space avail
wrote 10 out of 10 bytes to the stream
immediate pack buffer-> 040040008
Space avail
wrote 10 out of 10 bytes to the stream
Notice how it continually has "No Space" written which means that the method hasSpaceAvailable is returning false and forcing the data to be buffered until it returns true.
1) What I need to know is why is the happening? Is it waiting for an Ack from the BT hardware? If so how do you removing this blocking?
2) How do you do this so it sends immediately and we basically stream the data in real time without buffering?
3) Is there a hidden API method that will disable this blocking?
This is a real problem because there cannot be any delay/latency in sending the data to the device, it must be sent immediately in order for the hardware to be in sync with the iPhone commands. Please help.
What you're asking for is impossible with most hardware (which will finish sending the current packet before starting the next one), and impossible with the usual "stream" paradigm (which requires that data is received in order, so is bandwidth-limited).
It is also physically impossible to have zero latency unless the source and destination are coincident.
The actual problem seems to be that the underlying stream only queues one packet at a time, even if the packet is only 10 bytes long. I don't know why; possibly because it's intended as a very simple protocol.
The usual way of dealing with such a queue is to register for the appropriate delegate callbacks and send as much data as you can when the stream has space available, instead of waiting for the next time you attempt to send data (which appears to be what you're doing).
The problem is the HandleEvent delegate function is an asynchronous call.So every time it is not hitting the delegate.
What you can do is, have the collections of commands in an array at once, open the session, call the writeData Function.What happens here is, once the write data is called, you don't need the HandleEvent Function to be hit for every command.
Have a count incremented in writeData function for the count of array items,Until count == arrayItems, Delegate is not hit..
So all the commands from list are sent one by one.
I am facing the same issue but in different scenario.
Scenario: iPhone app is able to communicate the PED when gets connected for the first time. But when PED battery dies or switched off and then switched on, app is not able to communicate with PED in spite of active session and valid output stream. Output steam says its does not have spece to write anything.
Solution: When PED gets switched, app gets notified, and at that moment I make the app to kill EASession and create it again when PED gets connection. Not sure whether it is best solution. Please suggest another solution if there is any.

unix sockets: how to send really big data with one "send" call?

I'm using unix scoket for data transferring (SOCK_STREAM mode)
I need to send a string of more than 100k chars. Firstly, I send length of a string - it's sizeof(int) bytes.
length = strlen(s)
send(sd, length, sizeof(int))
Then I send the whole string
bytesSend = send(sd, s, length)
but for my surprise "bytesSend" is less than "length".
Note, that this works fine when I send not so big strings.
May be there exist some limitations for system call "send" that I've been missing ...
The send system call is supposed to be fast, because the program may have other things useful things to do. Certainly you do not want to wait for the data to be sent out and the other computer to send a reply - that would lead to terrible throughput.
So, all send really does is queues some data for sending and returns control to the program. The kernel could copy the entire message into kernel memory, but this would consume a lot of kernel memory (not good).
Instead, the kernel only queues as much of the message as is reasonable. It is the program's responsibility to re-attempt sending of the remaining data.
In your case, use a loop to send the data that did not get sent the first time.
while(length > 0) {
bytesSent = send(sd, s, length);
if (bytesSent == 0)
break; //socket probably closed
else if (bytesSent < 0)
break; //handle errors appropriately
s += bytesSent;
length -= bytesSent;
}
At the receiving end you will likely need to do the same thing.
Your initial send() call is wrong. You need to pass send() the address of the data, i.e.:
bytesSend = send(sd, &length, sizeof(int))
Also, this runs into some classical risks, with endianness, size of int on various platforms, et cetera.