MongoDB Batch read implementation issue with change stream replica set

MongoDB Batch read implementation issue with change stream replica set - mongodb

Issue:
A inference generating process is writing around 300 inference data's to a MongoDB collection per second. Change stream feature of MongoDB is utilized by another process to read back back these inferences and do the post-processing. Currently, only a single inference data is returned when the change stream function API (mongoc_change_stream_next())is called. So, a total of 300 such calls is required to get all inference data stored within 1 second. However, after each read, around 50ms of time is required to perform the post-processing for single/multiple inference data. Because of the single data return model, an effective latency of 15x is introduced. To tackle this issue, we are trying to implement a batch read mechanism in-line with change stream feature of MongoDB. We tried various options to implement the same, but still getting only one data after each change stream API call. Is there any way to sort out this issue?
Platform:
OS: Ubuntu 16.04
Mongo-c-driver: 1.15.1
Mongo server : 4.0.12
Options tried out:
Setting the batch size of the cursor to more than 1.
int main(void) {
const char *uri_string = "mongodb://localhost:27017/replicaSet=set0";
mongoc_change_stream_t *stream;
mongoc_collection_t *coll;
bson_error_t error;
mongoc_uri_t *uri;
mongoc_client_t *client;
/*
* Add the Mongo DB blocking read and scall the inference parse function with the Json
* */
uri = mongoc_uri_new_with_error (uri_string, &error);
if (!uri) {
fprintf (stderr,
"failed to parse URI: %s\n"
"error message: %s\n",
uri_string,
error.message);
return -1;
}
client = mongoc_client_new_from_uri (uri);
if (!client) {
return -1;
}
coll = mongoc_client_get_collection (client, <DB-NAME>, <collection-name>);
stream = mongoc_collection_watch (coll, &empty, NULL);
mongoc_cursor_set_batch_size(stream->cursor, 20);
while (1){
while (mongoc_change_stream_next (stream, &doc)) {
char *as_json = bson_as_relaxed_extended_json (doc, NULL);
............
............
//post processing consuming 50 ms of time
............
............
}
if (mongoc_change_stream_error_document (stream, &error, &err_doc)) {
if (!bson_empty (err_doc)) {
fprintf (stderr,
"Server Error: %s\n",
bson_as_relaxed_extended_json (err_doc, NULL));
} else {
fprintf (stderr, "Client Error: %s\n", error.message);
}
break;
}
}
return 0;
}

Currently, only a single inference data is returned when the change
stream function API (mongoc_change_stream_next())is called
Technically it's not that a single document is returned. This is because mongoc_change_stream_next() iterates the underlying cursor, setting each bson to the next document. So, even the batch size returned is more than one, it still has to iterate per document.
You could try:
Create separate threads to process the documents in parallel, so you don't have to wait 50ms per document or 15 seconds accumulatively.
Loop through a batch of documents, i.e. 50 cache them then perform a batch processing
Batch process them on separate threads (combination of the two above)

Related

Socket read often return -1 while the buffer is not empty

I am trying to test WiFi data transfer between cell phone and Esp32 (Arduino), when ESP32 reads file data via WiFi, even there is still data in, client.read() often return -1, I have to add other conditions to check reading finished or not.
My question is why there are so many failed reads, any ideas are highly appreciated.
void setup()
{
i=0;
Serial.begin(115200);
Serial.println("begin...");
// You can remove the password parameter if you want the AP to be open.
WiFi.softAP(ssid, password);
IPAddress myIP = WiFi.softAPIP();
Serial.print("AP IP address: ");
Serial.println(myIP);
server.begin();
Serial.println("Server started");
}
// the loop function runs over and over again until power down or reset
void loop()
{
WiFiClient client = server.available(); // listen for incoming clients
if(client) // if you get a client,
{
Serial.println("New Client."); // print a message out the serial port
Serial.println(client.remoteIP().toString());
while(client.connected()) // loop while the client's connected
{
while(client.available()>0) // if there's bytes to read from the client,
{
char c = client.read(); // read a byte, then
if(DOWNLOADFILE ==c){
pretime=millis();
uint8_t filename[32]={0};
uint8_t bFilesize[8];
long filesize;
int segment=0;
int remainder=0;
uint8_t data[512];
int len=0;
int totallen=0;
delay(50);
len=client.read(filename,32);
delay(50);
len=client.read(bFilesize,8);
filesize=BytesToLong(bFilesize);
segment=(int)filesize/512;
delay(50);
i=0; //succeed times
j=0; //fail times
////////////////////////////////////////////////////////////////////
//problem occures here, to many "-1" return value
// total read 24941639 bytes, succeed 49725 times, failed 278348 times
// if there were no read problems, it should only read 48,715 times and finish.
//But it total read 328,073 times, including 278,348 falied times, wasted too much time
while(((len=client.read(data,512))!=-1) || (totallen<filesize))
{
if(len>-1) {
totallen+=len;
i++;
}
else{
j++;
}
}
///loop read end , too many times read fail//////////////////////////////////////////////////////////////////
sprintf(toClient, "\nfile name %s,size %d, total read %d, segment %d, succeed %d times, failed %d times\n",filename,filesize,totallen,segment,i,j);
Serial.write(toClient);
curtime=millis();
sprintf(toClient, "time splashed %d ms, speed %d Bps\n", curtime-pretime, filesize*1000/(curtime-pretime));
Serial.write(toClient);
client.write(RETSUCCESS);
}
else
{
Serial.write("Unknow command\n");
}
}
}
// close the connection:
client.stop();
Serial.println("Client Disconnected.");
}

When you call available() and check for > 0, you are checking to see if there is one or more characters available to read. It will be true if just one character has arrived. You read one character, which is fine, but then you start reading more without stopping to see if there are more available.
TCP doesn't guarantee that if you write 100 characters to a socket that they all arrive at once. They can arrive in arbitrary "chunks" with arbitrary delays. All that's guaranteed is that they will eventually arrive in order (or if that's not possible because of networking issues, the connection will fail.)
In the absence of a blocking read function (I don't know if those exist) you have to do something like what you are doing. You have to read one character at a time and append it to a buffer, gracefully handing the possibility of getting a -1 (the next character isn't here yet, or the connection broke). In general you never want to try to read multiple characters in a single read(buf, len) unless you've just used available() to make sure len characters are actually available. And even that can fail if your buffers are really large. Stick to one-character-at-a-time.
It's a reasonable idea to call delay(1) when available() returns 0. In the places where you try to guess at something like delay(20) before reading a buffer you are rolling the dice - there's no promise that any amount of delay will guarantee bytes get delivered. Example: Maybe a drop of water fell on the chip's antenna and it won't work until the drop evaporates. Data could be delayed for minutes.
I don't know how available() behaves if the connection fails. You might have to do a read() and get back a -1 to diagnose a failed connection. The Arduino documentation is absolutely horrible, so you'll have to experiment.
TCP is much simpler to handle on platforms that have threads, blocking read, select() and other tools to manage data. Having only non-blocking read makes things harder, but there it is.
In some situations UDP is actually a lot simpler - there are more guarantees about getting messages of certain sizes in a single chunk. But of course whole messages can go missing or show up out of order. It's a trade-off.

How to load records from mongodb with limit using spring data

I want to load only 100000 records which are in NOT_STARTED status in mongodb and want to process those records and update status to STARTED. I want to repeat this process until all the records which are in NOT_STARTED status processed.
Currently i am using Pagerequest as shown in the below code and it seems working. But is there a way i can do this without pagerequest having my repository extends spring MongoRepository. Because Pagerequest seems for pagination. But i am not doing any pagination only loading 100000 records each time and processing them
Sort sort = new Sort(Sort.Direction.ASC, "_id");
int count = (int) PaymentReportRepository.count();
for(int i = 0; i < count; i += reportProperties.getPageSize()) {
List<PaymentReport> paymentReportList =
MongoTraceability.capture(() ->
PaymentReportRepository.findByStatusAndDateLessThan("NOT_STARTED",
LocalDateTime.now().minusSeconds(reportProperties.getTimeInterval()),
,PageRequest.of(0, reportProperties.getPageSize(), sort)));
if (paymentReportList != null && !paymentReportList.isEmpty()) {
for (PaymentReport paymentReport : paymentReportList) {
messageService.processMessage(paymentReport);
}
}
}

It appears that you're processing each record synchronously. Do you have any desire/ability to process asynchronously?
Will this solution be run off a single JVM?
From your question I'm assuming synchronous processing and a single JVM.
I would use Spring's MongoTemplate class. Example tutorials/examples here: https://www.baeldung.com/queries-in-spring-data-mongodb
MongoTemplate will allow you to write your query along the lines of query("NOT_STARTED").limit(100000) to return the results you want. Assuming your messageService.processMessage(paymentReport); is doing an update() to the document after it is done processing and updates its status, then your next query will retrieve the next 100000 messages with your desired status.

You can try to rename findByStatusAndDateLessThan to findFirst100000ByStatusAndDateLessThan

Read all available bytes from TCP Socket (unknown byte count)

I am having Problems useing the Indy TIdTCPClient.
I want to call a function, everytime if there is Data available on the socket. For this I have a Thread calling IdTCPClient->Socket->Readable(100).
The function itself looks like this:
TMemoryStream *mStream = new TMemoryStream;
int len = 0;
try
{
if(!Form1->IdTCPClient2->Connected())
Form1->IdTCPClient2->Connect();
mStream->Position = 0;
do
{
Form1->IdTCPClient2->Socket->ReadStream(mStream, 1);
}
while(Form1->IdTCPClient2->Socket->Readable(100));
len = mStream->Position;
mStream->Position = 0;
mStream->Read(Buffer, len);
}catch(Exception &Ex) {
Form1->DisplaySSH->Lines->Add(Ex.Message);
Form1->DisplaySSH->GoToTextEnd();
}
delete mStream;
It will not be called directly within the thread, but the thread triggers an event, which is calling this function. Which means I am using Readable(100) twice, without reading data in betwee.
So since I dont know how many bytes I have to read I thought I can read one byte, check if there is more available and then read another byte.
The Problem here is that the do while loop doesnt loop, it just runs once.
I am guessing that Readable does not quite wokt the way I need it to.
Is there any other way to receive all the bytes available in the Socket?

You should not be using Readable() directly in this situation. That call reports whether the underlying socket has pending unread data in its internal kernel buffer. That does not take into account that the TIdIOHandler may already have unread data in its InputBuffer that is left over from a previous read operation.
Use the TIdIOHandler::CheckForDataOnSource() method instead of TIdIOHandler::Readable():
TMemoryStream *mStream = new TMemoryStream;
try
{
if (!Form1->IdTCPClient2->Connected())
Form1->IdTCPClient2->Connect();
mStream->Position = 0;
do
{
if (Form1->IdTCPClient2->IOHander->InputBufferIsEmpty())
{
if (!Form1->IdTCPClient2->IOHander->CheckForDataOnSource(100))
break;
}
Form1->IdTCPClient2->IOHandler->ReadStream(mStream, Form1->IdTCPClient2->IOHandler->InputBuffer->Size, false);
/* alternatively:
Form1->IdTCPClient2->IOHandler->InputBuffer->ExtractToStream(mStream);
*/
}
while (true);
// use mStream as needed...
}
catch (const Exception &Ex) {
Form1->DisplaySSH->Lines->Add(Ex.Message);
Form1->DisplaySSH->GoToTextEnd();
}
delete mStream;
Or, you can alternatively use TIdIOHandler::ReadBytes() instead of TIdIOHandler::ReadStream(). If you set its AByteCount parameter to -1, it will return only the bytes that are currently available (if the InputBuffer is empty, ReadBytes() will wait up to the ReadTimeout interval for the socket to receive any new bytes) 1:
try
{
if (!Form1->IdTCPClient2->Connected())
Form1->IdTCPClient2->Connect();
TIdBytes data;
do
{
if (Form1->IdTCPClient2->IOHander->InputBufferIsEmpty())
{
if (!Form1->IdTCPClient2->IOHander->CheckForDataOnSource(100))
break;
}
Form1->IdTCPClient2->IOHandler->ReadBytes(data, -1, true);
/* alternatively:
Form1->IdTCPClient2->IOHandler->InputBuffer->ExtractToBytes(data, -1, true);
*/
}
while (true);
// use data as needed...
}
catch (const Exception &Ex) {
Form1->DisplaySSH->Lines->Add(Ex.Message);
Form1->DisplaySSH->GoToTextEnd();
}
1: make sure you are using an up-to-date snapshot of Indy 10. Prior to Oct 6 2016, there was a logic bug in ReadBytes() when AByteCount=-1 that didn't take the InputBuffer into account before checking the socket for new bytes.

Bulk operations in Mongoskin [duplicate]

I'm having trouble using Mongoskin to perform bulk inserting (MongoDB 2.6+) on Node.
var dbURI = urigoeshere;
var db = mongo.db(dbURI, {safe:true});
var bulk = db.collection('collection').initializeUnorderedBulkOp();
for (var i = 0; i < 200000; i++) {
bulk.insert({number: i}, function() {
console.log('bulk inserting: ', i);
});
}
bulk.execute(function(err, result) {
res.json('send response statement');
});
The above code gives the following warnings/errors:
(node) warning: possible EventEmitter memory leak detected. 51 listeners added. Use emitter.setMaxListeners() to increase limit.
TypeError: Object #<SkinClass> has no method 'execute'
(node) warning: possible EventEmitter memory leak detected. 51 listeners added. Use emitter.setMaxListeners() to increase limit.
TypeError: Object #<SkinClass> has no method 'execute'
Is it possible to use Mongoskin to perform unordered bulk operations? If so, what am I doing wrong?

You can do it but you need to change your calling conventions to do this as only the "callback" form will actually return a collection object from which the .initializeUnorderedBulkOp() method can be called. There are also some usage differences to how you think this works:
var dbURI = urigoeshere;
var db = mongo.db(dbURI, {safe:true});
db.collection('collection',function(err,collection) {
var bulk = collection.initializeUnorderedBulkOp();
count = 0;
for (var i = 0; i < 200000; i++) {
bulk.insert({number: i});
count++;
if ( count % 1000 == 0 )
bulk.execute(function(err,result) {
// maybe do something with results
bulk = collection.initializeUnorderedBulkOp(); // reset after execute
});
});
// If your loop was not a round divisor of 1000
if ( count % 1000 != 0 )
bulk.execute(function(err,result) {
// maybe do something here
});
});
So the actual "Bulk" methods themselves don't require callbacks and work exactly as shown in the documentation. The exeception is .execute() which actually sends the statements to the server.
While the driver will sort this out for you somewhat, it probably is not a great idea to queue up too many operations before calling execute. This basically builds up in memory, and though the driver will only send in batches of 1000 at a time ( this is a server limit as well as the complete batch being under 16MB ), you probably want a little more control here, at least to limit memory usage.
That is the point of the modulo tests as shown, but if memory for building the operations and a possibly really large response object are not a problem for you then you can just keep queuing up operations and call .execute() once.
The "response" is in the same format as given in the documentation for BulkWriteResult.

Data is getting discarded in TCP/IP with boost::asio::read_some?

I have implemented a TCP server using boost::asio. This server uses basic_stream_socket::read_some function to read data. I know that read_some does not guarantee that supplied buffer will be full before it returns.
In my project I am sending strings separated by a delimiter(if that matters). At client side I am using WinSock::send() function to send data. Now my problem is on server side I am not able to get all the strings which were sent from client side. My suspect is that read_some is receiving some data and discarding leftover data for some reason. Than again in next call its receiving another string.
Is it really possible in TCP/IP ?
I tried to use async_receive but that is eating up all my CPU, also since buffer has to be cleaned up by callback function its causing serious memory leak in my program. (I am using IoService::poll() to call handler. That handler is getting called at a very slow rate compared to calling rate of async_read()).
Again I tried to use free function read but that will not solve my purpose as it blocks for too much time with the buffer size I am supplying.
My previous implementation of the server was with WinSock API where I was able to receive all data using WinSock::recv().
Please give me some leads so that I can receive complete data using boost::asio.
here is my server side thread loop
void
TCPObject::receive()
{
if (!_asyncModeEnabled)
{
std::string recvString;
if ( !_tcpSocket->receiveData( _maxBufferSize, recvString ) )
{
LOG_ERROR("Error Occurred while receiving data on socket.");
}
else
_parseAndPopulateQueue ( recvString );
}
else
{
if ( !_tcpSocket->receiveDataAsync( _maxBufferSize ) )
{
LOG_ERROR("Error Occurred while receiving data on socket.");
}
}
}
receiveData() in TCPSocket
bool
TCPSocket::receiveData( unsigned int bufferSize, std::string& dataString )
{
boost::system::error_code error;
char *buf = new char[bufferSize + 1];
size_t len = _tcpSocket->read_some( boost::asio::buffer((void*)buf, bufferSize), error);
if(error)
{
LOG_ERROR("Error in receiving data.");
LOG_ERROR( error.message() );
_tcpSocket->close();
delete [] buf;
return false;
}
buf[len] ='\0';
dataString.insert( 0, buf );
delete [] buf;
return true;
}
receiveDataAsync in TCP Socket
bool
TCPSocket::receiveDataAsync( unsigned int bufferSize )
{
char *buf = new char[bufferSize + 1];
try
{
_tcpSocket->async_read_some( boost::asio::buffer( (void*)buf, bufferSize ),
boost::bind(&TCPSocket::_handleAsyncReceive,
this,
buf,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred) );
//! Asks io_service to execute callback
_ioService->poll();
}
catch (std::exception& e)
{
LOG_ERROR("Error Receiving Data Asynchronously");
LOG_ERROR( e.what() );
delete [] buf;
return false;
}
//we dont delete buf here as it will be deleted by callback _handleAsyncReceive
return true;
}
Asynch Receive handler
void
TCPSocket::_handleAsyncReceive(char *buf, const boost::system::error_code& ec, size_t size)
{
if(ec)
{
LOG_ERROR ("Error occurred while sending data Asynchronously.");
LOG_ERROR ( ec.message() );
}
else if ( size > 0 )
{
buf[size] = '\0';
emit _asyncDataReceivedSignal( QString::fromLocal8Bit( buf ) );
}
delete [] buf;
}
Client Side sendData function.
sendData(std::string data)
{
if(!_connected)
{
return;
}
const char *pBuffer = data.c_str();
int bytes = data.length() + 1;
int i = 0,j;
while (i < bytes)
{
j = send(_connectSocket, pBuffer+i, bytes-i, 0);
if(j == SOCKET_ERROR)
{
_connected = false;
if(!_bNetworkErrNotified)
{
_bNetworkErrNotified=true;
emit networkErrorSignal(j);
}
LOG_ERROR( "Unable to send Network Packet" );
break;
}
i += j;
}
}

Boost.Asio's TCP capabilities are pretty well used, so I would be hesitant to suspect it is the source of the problem. In most cases of data loss, the problem is the result of application code.
In this case, there is a problem in the receiver code. The sender is delimiting strings with \0. However, the receiver fails to proper handle the delimiter in cases where multiple strings are read in a single read operation, as string::insert() will cause truncation of the char* when it reaches the first delimiter.
For example, the sender writes two strings "Test string\0" and "Another test string\0". In TCPSocket::receiveData(), the receiver reads "Test string\0Another test string\0" into buf. dataString is then populated with dataString.insert(0, buf). This particular overload will copy up to the delimiter, so dataString will contain "Test string". To resolve this, consider using the string::insert() overload that takes the number of characters to insert: dataString.insert(0, buf, len).

I have not used the poll function before. What I did is create a worker thread that is dedicated to processing ASIO handlers with the run function, which blocks. The Boost documentation says that each thread that is to be made available to process async event handlers must first call the io_service:run or io_service:poll method. I'm not sure what else you are doing with the thread that calls poll.
So, I would suggest dedicating at least one worker thread for the async ASIO event handlers and use run instead of poll. If you want that worker thread to continue to process all async messages without returning and exiting, then add a work object to the io_service object. See this link for an example.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse