MongoDB Insertion in chunk takes varying time - mongodb

I am new to MongoDB and started doing a POC on improving the insertion time of a huge log file into MongoDB in chunks. My chunk sizes are constant (~2MB) and what I observe is out of ,say 20 chunks, all of a sudden 1 or 2 chunks in between (random) takes around 20-30% more time thank others.
I did vary the chunk size and saw this behavior kind of vanishes with lower chunk sizes. Also I did a profiling and saw a secondary thread checks for mongo db server status by pinging it and while receiving the message back from the server, the additional time is consumed. My guess is it is because of the concurrent write-lock.
Any expert advise on this and also a probably suggestion is welcome.
Thanks in advance.
The code snippet I've been using and measuring the time:
DateTime dt3 = DateTime.Now;
MongoInsertOptions options = new MongoInsertOptions();
options.WriteConcern = WriteConcern.Unacknowledged;
options.CheckElementNames = true;
//var task = InsertBatchAsync<LogEvent>(collection, logEventsChunk.LogEvents);
collection.InsertBatch(logEventsChunk.LogEvents, options);
Console.WriteLine("Chunk Number: " + chunkCount.ToString() + Environment.NewLine
+ "Write time for " + logEventsChunk.LogEvents.Count + " logs in MONGODB = " + DateTime.Now.Subtract(dt3).TotalSeconds + "s" + Environment.NewLine);
mongoDBInsertionTotalTime += DateTime.Now.Subtract(dt3).TotalSeconds;
This above code snippet goes in a loop for every chunk of data I get.

Increasing the buffer size for tcpClient (both send and receive) to 1GB helped. They are exposed as public properties under MongoDefault.cs.
On profiling the C# driver (Mongo) I observed the bottle neck was to copy the networkStream. So I increased the buffer and it worked.
Also, since my DB server is hosted locally I have got rid of the concurrent server status call back in the C# driver.

Related

MongoDB cursor is slow when fetching second batch

I'm seeing an odd behavior when iterating over a cursor: fetching the second batch is very slow (several seconds or more). The first batch is reasonably fast, as are all the batches after the second one. And the other odd thing is, I can make this behavior go away by increasing the batch size (default is 100, I increased it to 800).
MongoCursor<Document> it = collection
.find()
.batchSize(800)
.projection(Projections.fields(Projections.include("year")))
.iterator();
int count = 0;
while (it.hasNext()) {
System.out.println(""+count+" : "+it.next());
count++;
}
For the above example, I'm using a DB with about half a million records. If I don't set the batch size, it pauses after printing the 100th record, then continues normally.
Can someone explain why this is happening? Is it a bug in the server or the client? Does it indicate a problem with the way I set up my DB?
Server version: 4.0.10
Java client version: 3.10.1

Active Batches piling up with spark streaming with Kafka

I have developed spark streaming(1.6.2) with Kafka in receiver model and running this job with batch size as 15 seconds.
The very first batch is getting a lot of events and processing records very slow. Suddenly job fails and it restarts again. Please see the below screenshot.
It is processing records slowly but not as expected to finish all these batches on time and don't want to see this queue piling up.
How can we control this input size to around 15 to 20k events? I tried enabling spark.streaming.backpressure.enabled but don't see any improvement.
I also implemented Level of Parallelism in Data Receiving as below but still, I didn't see any change in the input size.
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
I am using 6 executors and 20 cores.
Overview of my code:
I am reading logs from Kafka and processing them and storing in elasticsearch in every 15 seconds batch interval.
Could you please let me know how can I control input size and improve the performance of the job or how can we make sure that the batches are not pilling up.

How to Deal with Inaccurate MSMQ Performance Counters

I have a private queue. I am using the following code to efficiently return the number of messages in the queue.
var queuePath = _queue.Path; //_queue is a System.Messaging.MessageQueue object
if (queuePath.StartsWith("."))
{
queuePath = System.Environment.GetEnvironmentVariable("COMPUTERNAME") + queuePath.Substring(1);
}
var queueCounter = new PerformanceCounter("MSMQ Queue","Messages in Queue",queuePath.ToLower(),true);
This code usually works but it appears to be sometimes wrong (i.e. the value lags the actual length). How can I force an MSMQ Queue counter to be up-to-date?
My test data inserts two items and then checks the length. If I step through it the value is correct but if I run it at full speed, the count comes back as zero.
Is there a better way than a Performance Counter for this purpose?

OrientDB 2.0.0 Bulk Load Using Java API is CPU-Bound

I'm using OrientDB 2.0.0 to test its handling of bulk data loading. For sample data, I'm using the GDELT dataset from Google's GDELT Project (free download). I'm loading a total of ~80M vertices, each with 8 properties, into the V class of a blank graph database using the Java API.
The data is in a single tab-delimited text file (US-ASCII), so I'm simply reading the text file from top to bottom. I configured the database using OIntentMassiveInsert(), and set the transaction size to 25,000 records per commit.
I'm using an 8-core machine with 32G RAM and an SSD, so the hardware should not be a factor. I'm running Windows 7 Pro with Java 8r31.
The first 20M (or so) records went in quite quickly, at under 2 seconds per batch of 25,000. I was very encouraged.
However, as the process has continued to run, the insert rate has slowed significantly. The slowing appears to be pretty linear. Here are some sample lines from my output log:
Committed 25000 GDELT Event records to OrientDB in 4.09989189 seconds at a rate of 6097 records per second. Total = 31350000
Committed 25000 GDELT Event records to OrientDB in 9.42005182 seconds at a rate of 2653 records per second. Total = 40000000
Committed 25000 GDELT Event records to OrientDB in 15.883908716 seconds at a rate of 1573 records per second. Total = 45000000
Committed 25000 GDELT Event records to OrientDB in 45.814514946 seconds at a rate of 545 records per second. Total = 50000000
As the operation has progressed, the memory usage has been pretty constant, but the CPU usage by OrientDB has climbed higher and higher, keeping consistent with the duration. In the beginning, the OrientDB Java process was using about 5% CPU. It is now up to about 90%, with the utilization being nicely distributed across all 8 cores.
Should I break the load operation down into several sequential connections, or is it really a function of how the vertex data is being managed internally and it would not matter if I stopped the process and continued inserting where I left off?
Thanks.
[Update] The process eventually died with the error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
All commits were successfully processed, and I ended up with a little over 51m records. I'll look at restructuring the loader to break the 1 giant file into many smaller files (say, 1m records each, for example) and treat each file as a separate load.
Once that completes, I will attempt to take the flat Vertex list and add some Edges. Any suggestions how to do that in the context of a bulk insert, where vertex IDs have not yet been assigned? Thanks.
[Update 2] I'm using the Graph API. Here is the code:
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
OrientGraph txGraph = factory.getTx();
// Iterate row by row over the file.
while ((line = reader.readLine()) != null) {
fields = line.split("\t");
try {
Vertex v = txGraph.addVertex(null); // 1st OPERATION: IMPLICITLY BEGIN A TRANSACTION
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
v.setProperty(headerFieldsReduced[i], fields[i]);
}
// Commit every so often to balance performance and transaction size
if (++counter % commitPoint == 0) {
txGraph.commit();
}
} catch( Exception e ) {
txGraph.rollback();
}
}
[Update 3 - 2015-02-08] Problem solved!
If I had read the documentation more carefully I would have seen that using transactions in a bulk load is the wrong strategy. I switched to using the "NoTx" graph and to adding the vertex properties in bulk, and it worked like a champ without slowing down over time or pegging the CPU.
I started with 52m vertexes in the database, and added 19m more in 22 minutes at a rate of just over 14,000 vertexes per second, with each vertex having 16 properties.
Map<String,Object> props = new HashMap<String,Object>();
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
graph = factory.getNoTx();
OrientVertex v = graph.addVertex(null);
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
props.put(headerFieldsReduced[i], fields[i]);
}
v.setProperties(props);

Pump Messages During Long Operations + C#

Hi I have a web service that is doing huge computation and is taking more than a minute.
I have generated the proxy file of the web service and then from my client end I am using the dll(of course I generated the proxy dll).
My client side code is
TimeSeries3D t = new TimeSeries3D();
int portfolioId = 4387919;
string[] str = new string[2];
str[0] = "MKT_CAP";
DateRange dr = new DateRange();
dr.mStartDate = DateTime.Today;
dr.mEndDate = DateTime.Today;
Service1 sc = new Service1();
t = sc.GetAttributesForPortfolio(portfolioId, true, str, dr);
But since it is taking to much time for the server to compute, after 1 minute I am receiving an error message
The CLR has been unable to transition from COM context 0x33caf30 to COM context 0x33cb0a0 for 60 seconds. The thread that owns the destination context/apartment is most likely either doing a non pumping wait or processing a very long running operation without pumping Windows messages. This situation generally has a negative performance impact and may even lead to the application becoming non responsive or memory usage accumulating continually over time. To avoid this problem, all single threaded apartment (STA) threads should use pumping wait primitives (such as CoWaitForMultipleHandles) and routinely pump messages during long running operations.
Kindly guide me what to do?
Thanks
Are you calling this from within a UI thread? If so, that's the problem. Put long-running operations on background threads, then marshal calls back to the UI to update it (e.g. with BackgroundWorker or Control.Invoke.)
You need to refactor you service interface to have something like the Begin/End pattern, and do the long work in the thread pool.
So the client calls once to start the operation, the server runs this in the threadpool.
Then, later, the client calls again to see if the process has completed (and get the results if it has).
If the long running task can report progress so the client gets more than "done"/"not done" so much the better.