I'm seeing an odd behavior when iterating over a cursor: fetching the second batch is very slow (several seconds or more). The first batch is reasonably fast, as are all the batches after the second one. And the other odd thing is, I can make this behavior go away by increasing the batch size (default is 100, I increased it to 800).
MongoCursor<Document> it = collection
.find()
.batchSize(800)
.projection(Projections.fields(Projections.include("year")))
.iterator();
int count = 0;
while (it.hasNext()) {
System.out.println(""+count+" : "+it.next());
count++;
}
For the above example, I'm using a DB with about half a million records. If I don't set the batch size, it pauses after printing the 100th record, then continues normally.
Can someone explain why this is happening? Is it a bug in the server or the client? Does it indicate a problem with the way I set up my DB?
Server version: 4.0.10
Java client version: 3.10.1
Related
I'm trying to simulate pallet behavior by using batch and move to. This works fine except towards the end where the number of elements left is smaller than the batch size, and these never get picked up. Any way out of this situation?
Have tried messing with custom queues, pickup/dropoff pairs.
To elaborate, the batch object has a queue size of 15. However once the entire set has been processed a number of elements less than 15 remain which don't get picked up by the subsequent moveTo block. I need to send the agents to the subsequent block once the queue size falls below 15.
You can dynamically change the batch size of your Batch object towards "the end" (whatever you mean by that :-) ). You need to figure out when to change the batch size (as this depends on your model). But once it is time to adjust, you can call myBatchItem.set_batchSize(1) and it will now batch things together individually.
However, a better model design might be to have a cool-down period before the model end, i.e. stop taking model measurements before your batch objects run out of agents to batch.
You need to know what the last element is somehow for example using a boolean variable called isLast in your agent that is true for the last agent.
and in the batch you have to change the batch size programatically.. maybe like this in the on enter action of your batch:
if(agent.isLast)
self.set_batchSize(self.size());
To determine if the "end" or any lack of supply is reached, I suggest a timeout. I would save a timestamp in a variable lastBatchDate in the OnExit code of the batch block:
lastBatchDate = date();
A cyclically activated event checkForLeftovers will check every once in a while if there is objects waiting to be batched and the timeout (here: 10 minutes) is reached. In this case, the batch size will be reduced to exactly the number of waiting objects, in order for them to continue in a smaller batch:
if( lastBatchDate!=null //prevent a NullPointerError when date is not yet filled
&& ((date().getTime()-lastBatchDate.getTime())/1000)>600 //more than 600 seconds since last batch
&& batch.size()>0 //something is waiting
&& batch.size()<BATCH_SIZE //not more then a normal batch is waiting
){
batch.set_batchSize(batch.size()); // set the batch size to exactly the amount waiting
}
else{
batch.set_batchSize(BATCH_SIZE); // reset the batch size to the default value BATCH_SIZE
}
The model will look something like this:
However, as Benjamin already noted, you should be careful if this is what you really need to model. Take care for example on these aspects:
Is the timeout long enough to not accidentally push smaller batches during normal operations?
Is the timeout short enough to have any effect?
Is it ok to have a batch of a smaller size downstream in your process?
etc.
You might just want to make sure upstream that the number of objects reaches the batching station are always filling full batches, or you might to just stop your simulation before the line "runs dry".
You can see the model and download the source code here.
I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.
The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.
While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:
The yellow fields represent the high processing time.
A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.
My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.
Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?
Is this a bug in the mapWithState() functionality or is this intended
behaviour?
This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.
Here is the relevant piece of code from MapWithStateDStream:
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
private[streaming] object InternalMapWithStateDStream {
private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.
This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.
In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.
Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].
This has two performance implications, which occur only once per 10 batches:
The traversal and deletion process has its price
If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)
I'm using OrientDB 2.0.0 to test its handling of bulk data loading. For sample data, I'm using the GDELT dataset from Google's GDELT Project (free download). I'm loading a total of ~80M vertices, each with 8 properties, into the V class of a blank graph database using the Java API.
The data is in a single tab-delimited text file (US-ASCII), so I'm simply reading the text file from top to bottom. I configured the database using OIntentMassiveInsert(), and set the transaction size to 25,000 records per commit.
I'm using an 8-core machine with 32G RAM and an SSD, so the hardware should not be a factor. I'm running Windows 7 Pro with Java 8r31.
The first 20M (or so) records went in quite quickly, at under 2 seconds per batch of 25,000. I was very encouraged.
However, as the process has continued to run, the insert rate has slowed significantly. The slowing appears to be pretty linear. Here are some sample lines from my output log:
Committed 25000 GDELT Event records to OrientDB in 4.09989189 seconds at a rate of 6097 records per second. Total = 31350000
Committed 25000 GDELT Event records to OrientDB in 9.42005182 seconds at a rate of 2653 records per second. Total = 40000000
Committed 25000 GDELT Event records to OrientDB in 15.883908716 seconds at a rate of 1573 records per second. Total = 45000000
Committed 25000 GDELT Event records to OrientDB in 45.814514946 seconds at a rate of 545 records per second. Total = 50000000
As the operation has progressed, the memory usage has been pretty constant, but the CPU usage by OrientDB has climbed higher and higher, keeping consistent with the duration. In the beginning, the OrientDB Java process was using about 5% CPU. It is now up to about 90%, with the utilization being nicely distributed across all 8 cores.
Should I break the load operation down into several sequential connections, or is it really a function of how the vertex data is being managed internally and it would not matter if I stopped the process and continued inserting where I left off?
Thanks.
[Update] The process eventually died with the error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
All commits were successfully processed, and I ended up with a little over 51m records. I'll look at restructuring the loader to break the 1 giant file into many smaller files (say, 1m records each, for example) and treat each file as a separate load.
Once that completes, I will attempt to take the flat Vertex list and add some Edges. Any suggestions how to do that in the context of a bulk insert, where vertex IDs have not yet been assigned? Thanks.
[Update 2] I'm using the Graph API. Here is the code:
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
OrientGraph txGraph = factory.getTx();
// Iterate row by row over the file.
while ((line = reader.readLine()) != null) {
fields = line.split("\t");
try {
Vertex v = txGraph.addVertex(null); // 1st OPERATION: IMPLICITLY BEGIN A TRANSACTION
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
v.setProperty(headerFieldsReduced[i], fields[i]);
}
// Commit every so often to balance performance and transaction size
if (++counter % commitPoint == 0) {
txGraph.commit();
}
} catch( Exception e ) {
txGraph.rollback();
}
}
[Update 3 - 2015-02-08] Problem solved!
If I had read the documentation more carefully I would have seen that using transactions in a bulk load is the wrong strategy. I switched to using the "NoTx" graph and to adding the vertex properties in bulk, and it worked like a champ without slowing down over time or pegging the CPU.
I started with 52m vertexes in the database, and added 19m more in 22 minutes at a rate of just over 14,000 vertexes per second, with each vertex having 16 properties.
Map<String,Object> props = new HashMap<String,Object>();
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
graph = factory.getNoTx();
OrientVertex v = graph.addVertex(null);
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
props.put(headerFieldsReduced[i], fields[i]);
}
v.setProperties(props);
I am new to MongoDB and started doing a POC on improving the insertion time of a huge log file into MongoDB in chunks. My chunk sizes are constant (~2MB) and what I observe is out of ,say 20 chunks, all of a sudden 1 or 2 chunks in between (random) takes around 20-30% more time thank others.
I did vary the chunk size and saw this behavior kind of vanishes with lower chunk sizes. Also I did a profiling and saw a secondary thread checks for mongo db server status by pinging it and while receiving the message back from the server, the additional time is consumed. My guess is it is because of the concurrent write-lock.
Any expert advise on this and also a probably suggestion is welcome.
Thanks in advance.
The code snippet I've been using and measuring the time:
DateTime dt3 = DateTime.Now;
MongoInsertOptions options = new MongoInsertOptions();
options.WriteConcern = WriteConcern.Unacknowledged;
options.CheckElementNames = true;
//var task = InsertBatchAsync<LogEvent>(collection, logEventsChunk.LogEvents);
collection.InsertBatch(logEventsChunk.LogEvents, options);
Console.WriteLine("Chunk Number: " + chunkCount.ToString() + Environment.NewLine
+ "Write time for " + logEventsChunk.LogEvents.Count + " logs in MONGODB = " + DateTime.Now.Subtract(dt3).TotalSeconds + "s" + Environment.NewLine);
mongoDBInsertionTotalTime += DateTime.Now.Subtract(dt3).TotalSeconds;
This above code snippet goes in a loop for every chunk of data I get.
Increasing the buffer size for tcpClient (both send and receive) to 1GB helped. They are exposed as public properties under MongoDefault.cs.
On profiling the C# driver (Mongo) I observed the bottle neck was to copy the networkStream. So I increased the buffer and it worked.
Also, since my DB server is hosted locally I have got rid of the concurrent server status call back in the C# driver.
By default Mongo cursors die after 10 minutes of inactivity. I have a blank cursor that I eventually want to run though the whole database but there will be times of inactivity for over 10 minutes. I need a way to keep this alive to I can keep calling it.
Setting the expiry time completely off is not an option. If this program crashes it will cause cursors to linger in the databases memory which is not good. Also occasionally calling .next() during my other stuff does not work as the batch sizes are set fairly high to get good performance on the other parts of the code that are calling the cursor a lot.
I tried just periodically calling cursor.alive to see if that sent a signal to Mongo that would keep the cursor active but that did not work.
Try to use a smaller batch size. This will cause activity and you should not hit the 10 minute timeout.
for doc in coll.find().batch_size(10):
Alternatively you can set timeout=False when calling find (this could lead to issues when the cursor is not manually closed):
for doc in coll.find(timeout=False)