Bulk operations in Mongoskin [duplicate] - mongodb

I'm having trouble using Mongoskin to perform bulk inserting (MongoDB 2.6+) on Node.
var dbURI = urigoeshere;
var db = mongo.db(dbURI, {safe:true});
var bulk = db.collection('collection').initializeUnorderedBulkOp();
for (var i = 0; i < 200000; i++) {
bulk.insert({number: i}, function() {
console.log('bulk inserting: ', i);
});
}
bulk.execute(function(err, result) {
res.json('send response statement');
});
The above code gives the following warnings/errors:
(node) warning: possible EventEmitter memory leak detected. 51 listeners added. Use emitter.setMaxListeners() to increase limit.
TypeError: Object #<SkinClass> has no method 'execute'
(node) warning: possible EventEmitter memory leak detected. 51 listeners added. Use emitter.setMaxListeners() to increase limit.
TypeError: Object #<SkinClass> has no method 'execute'
Is it possible to use Mongoskin to perform unordered bulk operations? If so, what am I doing wrong?

You can do it but you need to change your calling conventions to do this as only the "callback" form will actually return a collection object from which the .initializeUnorderedBulkOp() method can be called. There are also some usage differences to how you think this works:
var dbURI = urigoeshere;
var db = mongo.db(dbURI, {safe:true});
db.collection('collection',function(err,collection) {
var bulk = collection.initializeUnorderedBulkOp();
count = 0;
for (var i = 0; i < 200000; i++) {
bulk.insert({number: i});
count++;
if ( count % 1000 == 0 )
bulk.execute(function(err,result) {
// maybe do something with results
bulk = collection.initializeUnorderedBulkOp(); // reset after execute
});
});
// If your loop was not a round divisor of 1000
if ( count % 1000 != 0 )
bulk.execute(function(err,result) {
// maybe do something here
});
});
So the actual "Bulk" methods themselves don't require callbacks and work exactly as shown in the documentation. The exeception is .execute() which actually sends the statements to the server.
While the driver will sort this out for you somewhat, it probably is not a great idea to queue up too many operations before calling execute. This basically builds up in memory, and though the driver will only send in batches of 1000 at a time ( this is a server limit as well as the complete batch being under 16MB ), you probably want a little more control here, at least to limit memory usage.
That is the point of the modulo tests as shown, but if memory for building the operations and a possibly really large response object are not a problem for you then you can just keep queuing up operations and call .execute() once.
The "response" is in the same format as given in the documentation for BulkWriteResult.

Related

MongoDB Batch read implementation issue with change stream replica set

Issue:
A inference generating process is writing around 300 inference data's to a MongoDB collection per second. Change stream feature of MongoDB is utilized by another process to read back back these inferences and do the post-processing. Currently, only a single inference data is returned when the change stream function API (mongoc_change_stream_next())is called. So, a total of 300 such calls is required to get all inference data stored within 1 second. However, after each read, around 50ms of time is required to perform the post-processing for single/multiple inference data. Because of the single data return model, an effective latency of 15x is introduced. To tackle this issue, we are trying to implement a batch read mechanism in-line with change stream feature of MongoDB. We tried various options to implement the same, but still getting only one data after each change stream API call. Is there any way to sort out this issue?
Platform:
OS: Ubuntu 16.04
Mongo-c-driver: 1.15.1
Mongo server : 4.0.12
Options tried out:
Setting the batch size of the cursor to more than 1.
int main(void) {
const char *uri_string = "mongodb://localhost:27017/replicaSet=set0";
mongoc_change_stream_t *stream;
mongoc_collection_t *coll;
bson_error_t error;
mongoc_uri_t *uri;
mongoc_client_t *client;
/*
* Add the Mongo DB blocking read and scall the inference parse function with the Json
* */
uri = mongoc_uri_new_with_error (uri_string, &error);
if (!uri) {
fprintf (stderr,
"failed to parse URI: %s\n"
"error message: %s\n",
uri_string,
error.message);
return -1;
}
client = mongoc_client_new_from_uri (uri);
if (!client) {
return -1;
}
coll = mongoc_client_get_collection (client, <DB-NAME>, <collection-name>);
stream = mongoc_collection_watch (coll, &empty, NULL);
mongoc_cursor_set_batch_size(stream->cursor, 20);
while (1){
while (mongoc_change_stream_next (stream, &doc)) {
char *as_json = bson_as_relaxed_extended_json (doc, NULL);
............
............
//post processing consuming 50 ms of time
............
............
}
if (mongoc_change_stream_error_document (stream, &error, &err_doc)) {
if (!bson_empty (err_doc)) {
fprintf (stderr,
"Server Error: %s\n",
bson_as_relaxed_extended_json (err_doc, NULL));
} else {
fprintf (stderr, "Client Error: %s\n", error.message);
}
break;
}
}
return 0;
}
Currently, only a single inference data is returned when the change
stream function API (mongoc_change_stream_next())is called
Technically it's not that a single document is returned. This is because mongoc_change_stream_next() iterates the underlying cursor, setting each bson to the next document. So, even the batch size returned is more than one, it still has to iterate per document.
You could try:
Create separate threads to process the documents in parallel, so you don't have to wait 50ms per document or 15 seconds accumulatively.
Loop through a batch of documents, i.e. 50 cache them then perform a batch processing
Batch process them on separate threads (combination of the two above)

Parallel.Foreach and BulkCopy

I have a C# library which connects to 59 servers of the same database structure and imports data to my local db to the same table. At this moment I am retrieving data server by server in foreach loop:
foreach (var systemDto in systems)
{
var sourceConnectionString = _systemService.GetConnectionStringAsync(systemDto.Ip).Result;
var dbConnectionFactory = new DbConnectionFactory(sourceConnectionString,
"System.Data.SqlClient");
var dbContext = new DbContext(dbConnectionFactory);
var storageRepository = new StorageRepository(dbContext);
var usedStorage = storageRepository.GetUsedStorageForCurrentMonth();
var dtUsedStorage = new DataTable();
dtUsedStorage.Load(usedStorage);
var dcIp = new DataColumn("IP", typeof(string)) {DefaultValue = systemDto.Ip};
var dcBatchDateTime = new DataColumn("BatchDateTime", typeof(string))
{
DefaultValue = batchDateTime
};
dtUsedStorage.Columns.Add(dcIp);
dtUsedStorage.Columns.Add(dcBatchDateTime);
using (var blkCopy = new SqlBulkCopy(destinationConnectionString))
{
blkCopy.DestinationTableName = "dbo.tbl";
blkCopy.WriteToServer(dtUsedStorage);
}
}
Because there are many systems to retrieve data, I wonder if it is possible to use Pararel.Foreach loop? Will BulkCopy lock the table during WriteToServer and next WriteToServer will wait until previous will complete?
-- EDIT 1
I've changed Foreach to Parallel.Foreach but I face one problem. Inside this loop I have async method: _systemService.GetConnectionStringAsync(systemDto.Ip)
and this line returns error:
System.NotSupportedException: A second operation started on this
context before a previous asynchronous operation completed. Use
'await' to ensure that any asynchronous operations have completed
before calling another method on this context. Any instance members
are not guaranteed to be thread safe.
Any ideas how can I handle this?
In general, it will get blocked and will wait until the previous operation complete.
There are some factors that may affect if SqlBulkCopy can be run in parallel or not.
I remember when adding the Parallel feature to my .NET Bulk Operations, I had hard time to make it work correctly in parallel but that worked well when the table has no index (which is likely never the case)
Even when worked, the performance gain was not a lot faster.
Perhaps you will find more information here: MSDN - Importing Data in Parallel with Table Level Locking

how to use couchbase as fifo queue

With Java client, how can I use couchbase to implement FIFO queue, thread safe? There can be many threads popping from the queue, and pushing into the queue. Each object in the queue is a string[].
Couchbase doesn't have any built-in functionality for creating queues, but you can do that by yourself.
I'll explain how to do that in short example below.
I.e. we have queue with name queue and it will have items with names item:<index>. To implement queue you'll need to store your values with key like: <queue_name>:item:<index>, where index will be separate key queue:index, that you need to increment while pushing to queue, and decrement while popping.
In couchbase you could use increment and decrement operations to implement queue, because that operations are atomic and threadsafe.
So code of your push and pop functions will be like:
void push(string queue, string[] value){
int index = couchbase.increment(queue + ':index');
couchbase.set(queue + ':item:' + index, value);
}
string[] pop(string queue){
int index = couchbase.get(queue + ':index');
string[] result = couchbase.get(queue + ':item:' + index);
couchbase.decrement(queue + ':index');
return result;
}
Sorry for code, Ive used java and couchbase java client a long time ago. If now java client have callbacks, like nodejs client, you can rewrite that code to use callbacks. It will be better, I think.
Also you can add additional check into set operation - use add (in C# client it called StoreMode.Add) operation that will throw exception if item with given key has already exists. And you can catch that exception and call push function again for same arguments.
UPD: I'm sorry, it was too early in the morning, so I couldn't think clear.
For fifo, as #avsej said you'll need two counters: queue:head and queue:tail. So for fifo:
void push(string queue, string[] value){
int index = couchbase.increment(queue + ':tail');
couchbase.set(queue + ':item:' + index, value);
}
string[] pop(string queue){
int index = couchbase.increment(queue + ':head') - 1;
string[] result = couchbase.get(queue + ':item:' + index);
return result;
}
Note: code can look slightly different depending on start values of queue:tail and queue:head (will it be zero or one or something else).
Also you can set some max value for counters, after reaching it, queue:tail and queue:head will be reseted to 0 (just to limit number of documents). Also you can set expire value to each document, if you actually need this.
Couchbase already CouchbaseQueue data structure.
Example usage: taken from the below SDK documentation
Queue<String> shoppingList = new CouchbaseQueue<String>("queueDocId", collection, String.class, QueueOptions.queueOptions());
shoppingList.add("loaf of bread");
shoppingList.add("container of milk");
shoppingList.add("stick of butter");
// What does the JSON document look like?
System.out.println(collection.get("queueDocId").contentAsArray());
//=> ["stick of butter","container of milk","loaf of bread"]
String item;
while ((item = shoppingList.poll()) != null) {
System.out.println(item);
// => loaf of bread
// => container of milk
// => stick of butter
}
// What does the JSON document look like after draining the queue?
System.out.println(collection.get("queueDocId").contentAsArray());
//=> []
Java SDK 3.1 CouchbaseQueue Doc

Perform long-polling from nodejs (possible memory leak)

I wrote a piece of code that is going to perform a request to Facebook.
Now i wrapped this code into a infinite loop which is going to send those requests every 10 seconds using timeouts.
Code:
var poll = function(socket, userProvider) {
var lastCallTime = new Date();
var polling = true;
// The stream itself, non blocking
function performPoll() {
var results = feed(function (err, data) {
lastCallTime = new Date();
// PROCESS DATA
// Check new posts
if (polling) {
setTimeout(performPoll, 1000 * 10);
}
});
};
// Start infinite loop
performPoll();
};
The feed(cb) is just going to call a request to Facebook requesting data, this works 100% and does what i want it to do, the only problem that i am having now is that this piece of code is keeping to increase my memory usage. After a few minutes it increased by 50MB already (From 50 -> 100).
Is there anybody that can help me identify the cause of this?
v8 does not collect memory immediately. If it stabilizes at 100mb, then it is to be expected. For more information, checkout nodejs setTimeout memory leak?
If you really, really want to clear the memory, use global.gc(). Read this blog about how to call garbage collector manually.

Mongo c# driver freezes and never returns a value on Update()

I have a long running operation that inserts thousands of sets of entries, each time a set is inserted using the code below.
After a while of this code running, the collection.Update() method freezes (does not return) and the entire process grinds to a halt.
Can't find any reasonable explanation for this anywhere.
I've looked at the mongod logs, nothing unusual, it just stops receiving requests from this process.
Mongo version: 2.4.1, C# driver version: 1.8.0
using (_mongoServer.RequestStart(_database))
{
var collection = GetCollection<BsonDocument>(collectionName);
// Iterate over all records
foreach (var recordToInsert in recordsDescriptorsToInsert)
{
var query = new QueryDocument();
var update = new UpdateBuilder();
foreach (var property in recordToInsert)
{
var field = property.Item1;
var value = BsonValue.Create(property.Item2);
if (keys.Contains(field))
query.Add(field, value);
update.Set(field, value);
}
collection.Update(query, update, UpdateFlags.Upsert); // ** NEVER RETURNS **
}
}
This is may related to this: CSHARP-717
It was fixed for driver 1.8.1