OrientDB 2.0.0 Bulk Load Using Java API is CPU-Bound - orientdb

I'm using OrientDB 2.0.0 to test its handling of bulk data loading. For sample data, I'm using the GDELT dataset from Google's GDELT Project (free download). I'm loading a total of ~80M vertices, each with 8 properties, into the V class of a blank graph database using the Java API.
The data is in a single tab-delimited text file (US-ASCII), so I'm simply reading the text file from top to bottom. I configured the database using OIntentMassiveInsert(), and set the transaction size to 25,000 records per commit.
I'm using an 8-core machine with 32G RAM and an SSD, so the hardware should not be a factor. I'm running Windows 7 Pro with Java 8r31.
The first 20M (or so) records went in quite quickly, at under 2 seconds per batch of 25,000. I was very encouraged.
However, as the process has continued to run, the insert rate has slowed significantly. The slowing appears to be pretty linear. Here are some sample lines from my output log:
Committed 25000 GDELT Event records to OrientDB in 4.09989189 seconds at a rate of 6097 records per second. Total = 31350000
Committed 25000 GDELT Event records to OrientDB in 9.42005182 seconds at a rate of 2653 records per second. Total = 40000000
Committed 25000 GDELT Event records to OrientDB in 15.883908716 seconds at a rate of 1573 records per second. Total = 45000000
Committed 25000 GDELT Event records to OrientDB in 45.814514946 seconds at a rate of 545 records per second. Total = 50000000
As the operation has progressed, the memory usage has been pretty constant, but the CPU usage by OrientDB has climbed higher and higher, keeping consistent with the duration. In the beginning, the OrientDB Java process was using about 5% CPU. It is now up to about 90%, with the utilization being nicely distributed across all 8 cores.
Should I break the load operation down into several sequential connections, or is it really a function of how the vertex data is being managed internally and it would not matter if I stopped the process and continued inserting where I left off?
Thanks.
[Update] The process eventually died with the error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
All commits were successfully processed, and I ended up with a little over 51m records. I'll look at restructuring the loader to break the 1 giant file into many smaller files (say, 1m records each, for example) and treat each file as a separate load.
Once that completes, I will attempt to take the flat Vertex list and add some Edges. Any suggestions how to do that in the context of a bulk insert, where vertex IDs have not yet been assigned? Thanks.
[Update 2] I'm using the Graph API. Here is the code:
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
OrientGraph txGraph = factory.getTx();
// Iterate row by row over the file.
while ((line = reader.readLine()) != null) {
fields = line.split("\t");
try {
Vertex v = txGraph.addVertex(null); // 1st OPERATION: IMPLICITLY BEGIN A TRANSACTION
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
v.setProperty(headerFieldsReduced[i], fields[i]);
}
// Commit every so often to balance performance and transaction size
if (++counter % commitPoint == 0) {
txGraph.commit();
}
} catch( Exception e ) {
txGraph.rollback();
}
}
[Update 3 - 2015-02-08] Problem solved!
If I had read the documentation more carefully I would have seen that using transactions in a bulk load is the wrong strategy. I switched to using the "NoTx" graph and to adding the vertex properties in bulk, and it worked like a champ without slowing down over time or pegging the CPU.
I started with 52m vertexes in the database, and added 19m more in 22 minutes at a rate of just over 14,000 vertexes per second, with each vertex having 16 properties.
Map<String,Object> props = new HashMap<String,Object>();
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
graph = factory.getNoTx();
OrientVertex v = graph.addVertex(null);
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
props.put(headerFieldsReduced[i], fields[i]);
}
v.setProperties(props);

Related

How to speed up downloading a CSV file locally with PySpark (databricks)?

We created an ImageClassifier to predict wether certain instagram images are of a certain class. Running this model works fine.
#creating deep image feauturizer using the InceptionV3 lib
featurizer = DeepImageFeaturizer(inputCol="image",
outputCol="features",
modelName="InceptionV3")
#using lr for speed and reliability
lr = LogisticRegression(maxIter=5, regParam=0.03,
elasticNetParam=0.5, labelCol="label")
#define Pipeline
sparkdn = Pipeline(stages=[featurizer, lr])
spark_model = sparkdn.fit(df)
We made this seperately from our basetable (which runs on a higher cluster). We need to extract the spark_model predictions as a csv to import it back in the other notebook and merge it with our basetable.
To do this we have tried the following
image_final_estimation = spark_model.transform(image_final)
display(image_final_estimation) #since this gives an option in databricks to
download the csv
AND
image_final_estimation.coalesce(1).write.csv(path = 'imagesPred2.csv') #and then we would be able to read it back in with spark.read.csv
The thing is these operations take very long (probably due to the nature of the task) and they crash our cluster. We are able to show our outcome, but not only with '.show()', not with the display() method.
Is there any other way to save this csv locally? Or how can we improve the speed of these tasks?
Please note that we use the community edition of Databricks.
When storing DataFrames on files, a good way to parallelize the writing is defining an appropiate number of Partitions related to that DataFrame/RDD.
In the code shown by you, you are using the coalesce function (which it basically reduces the number of partitions to 1, which reduces the effects of parallelism).
On Databricks Community Edition, I tried the following test with a CSV Dataset provided by Databricks( https://docs.databricks.com/getting-started/databricks-datasets.html). The idea is to measure the time elapsed by writing the data into a csv using one partition vs using many partitions.
carDF = spark.read.option("header", True).csv("dbfs:/databricks-datasets/Rdatasets/data-001/csv/car/*")
print("Total count of Rows {0}".format(carDF.count()))
print("Original Partitions Number: {0}".format(carDF.rdd.getNumPartitions()))
>>Total count of Rows 39005
>>Original Partitions Number: 7
%timeit carDF.write.format("csv").mode("overwrite").save("/tmp/caroriginal")
>>2.79 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, up to now, writing the dataset on a local file using 7 partitions took 2.79 seconds
newCarDF = carDF.coalesce(1)
print("Total count of Rows {0}".format(newCarDF.count()))
print("New Partitions Number: {0}".format(newCarDF.rdd.getNumPartitions()))
>>Total count of Rows 39005
>>New Partitions Number: 1
%timeit newCarDF.write.format("csv").mode("overwrite").save("/tmp/carmodified")
>>4.13 s ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, for the same DataFrame, writing to a csv with one partition took 4,13 seconds.
In conclusion, in this case the "coalesce(1)" part is affecting on the writing performance.
Hope this helps

MongoDB cursor is slow when fetching second batch

I'm seeing an odd behavior when iterating over a cursor: fetching the second batch is very slow (several seconds or more). The first batch is reasonably fast, as are all the batches after the second one. And the other odd thing is, I can make this behavior go away by increasing the batch size (default is 100, I increased it to 800).
MongoCursor<Document> it = collection
.find()
.batchSize(800)
.projection(Projections.fields(Projections.include("year")))
.iterator();
int count = 0;
while (it.hasNext()) {
System.out.println(""+count+" : "+it.next());
count++;
}
For the above example, I'm using a DB with about half a million records. If I don't set the batch size, it pauses after printing the 100th record, then continues normally.
Can someone explain why this is happening? Is it a bug in the server or the client? Does it indicate a problem with the way I set up my DB?
Server version: 4.0.10
Java client version: 3.10.1

Active Batches piling up with spark streaming with Kafka

I have developed spark streaming(1.6.2) with Kafka in receiver model and running this job with batch size as 15 seconds.
The very first batch is getting a lot of events and processing records very slow. Suddenly job fails and it restarts again. Please see the below screenshot.
It is processing records slowly but not as expected to finish all these batches on time and don't want to see this queue piling up.
How can we control this input size to around 15 to 20k events? I tried enabling spark.streaming.backpressure.enabled but don't see any improvement.
I also implemented Level of Parallelism in Data Receiving as below but still, I didn't see any change in the input size.
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
I am using 6 executors and 20 cores.
Overview of my code:
I am reading logs from Kafka and processing them and storing in elasticsearch in every 15 seconds batch interval.
Could you please let me know how can I control input size and improve the performance of the job or how can we make sure that the batches are not pilling up.

YARN killing containers for exceeding memory limits

I am running into an issue where YARN is killing my containers for exceeding memory limits:
Container killed by YARN for exceeding memory limits. physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I have 20 nodes that are of m3.2xlarge so they have:
cores: 8
memory: 30
storage: 200 gb ebs
The gist of my application is that I have a couple 100k assets for which I have historical data generated for each hour of the last year, with a total dataset size of 2TB uncompressed. I need to use this historical data to generate a forecast for each asset. My setup is that I first use s3distcp to move the data stored as indexed lzo files to hdfs. I then pull the data in and pass it to sparkSql to handle the json:
val files = sc.newAPIHadoopFile("hdfs:///local/*",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text],conf)
val lzoRDD = files.map(_._2.toString)
val data = sqlContext.read.json(lzoRDD)
I then use a groupBy to group the historical data by asset, creating a tuple of (assetId,timestamp,sparkSqlRow). I figured this data structure would allow for better in memory operations when generating the forecasts per asset.
val p = data.map(asset => (asset.getAs[String]("assetId"),asset.getAs[Long]("timestamp"),asset)).groupBy(_._1)
I then use a foreach to iterate over each row, calculate the forecast, and finally write the forecast back out as a json file to s3.
p.foreach{ asset =>
(1 to dateTimeRange.toStandardHours.getHours).foreach { hour =>
// determine the hour from the previous year
val hourFromPreviousYear = (currentHour + hour.hour) - timeRange
// convert to seconds
val timeToCompare = hourFromPreviousYear.getMillis
val al = asset._2.toList
println(s"Working on asset ${asset._1} for hour $hour with time-to-compare: $timeToCompare")
// calculate the year over year average for the asset
val yoy = calculateYOYforAsset2(al, currentHour, asset._1)
// get the historical data for the asset from the previous year
val pa = asset._2.filter(_._2 == timeToCompare)
.map(row => calculateForecast(yoy, row._3, asset._1, (currentHour + hour.hour).getMillis))
.foreach(json => writeToS3(json, asset._1, (currentHour + hour.hour).getMillis))
}
}
Is there a better way to accomplish this so that I don't hit the memory issue with YARN?
Is there a way to chunk the assets so that the foreach only operates on about 10k at a time vs all 200k of the assets?
Any advice/help appreciated!
Its not your code. And don't worry foreach does not run all those lambdas concurrently. The problem is that Spark's default value of spark.yarn.executor.memoryOverhead (or recently renamed in 2.3+ to spark.executor.memoryOverhead) is overly conservative which causes your executors to be killed when under load.
The solution is (as suggested by the error message) to increase that value. I would start with setting it to 1GB (set to 1024) or more if you are requesting lots of memory for each executor. The goal is to get jobs running without any executors being killed.
Alternatively if you control the cluster, you could disable YARN memory enforcement via by setting the configs yarn.nodemanager.pmem-check-enabled and yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml

MongoDB Insertion in chunk takes varying time

I am new to MongoDB and started doing a POC on improving the insertion time of a huge log file into MongoDB in chunks. My chunk sizes are constant (~2MB) and what I observe is out of ,say 20 chunks, all of a sudden 1 or 2 chunks in between (random) takes around 20-30% more time thank others.
I did vary the chunk size and saw this behavior kind of vanishes with lower chunk sizes. Also I did a profiling and saw a secondary thread checks for mongo db server status by pinging it and while receiving the message back from the server, the additional time is consumed. My guess is it is because of the concurrent write-lock.
Any expert advise on this and also a probably suggestion is welcome.
Thanks in advance.
The code snippet I've been using and measuring the time:
DateTime dt3 = DateTime.Now;
MongoInsertOptions options = new MongoInsertOptions();
options.WriteConcern = WriteConcern.Unacknowledged;
options.CheckElementNames = true;
//var task = InsertBatchAsync<LogEvent>(collection, logEventsChunk.LogEvents);
collection.InsertBatch(logEventsChunk.LogEvents, options);
Console.WriteLine("Chunk Number: " + chunkCount.ToString() + Environment.NewLine
+ "Write time for " + logEventsChunk.LogEvents.Count + " logs in MONGODB = " + DateTime.Now.Subtract(dt3).TotalSeconds + "s" + Environment.NewLine);
mongoDBInsertionTotalTime += DateTime.Now.Subtract(dt3).TotalSeconds;
This above code snippet goes in a loop for every chunk of data I get.
Increasing the buffer size for tcpClient (both send and receive) to 1GB helped. They are exposed as public properties under MongoDefault.cs.
On profiling the C# driver (Mongo) I observed the bottle neck was to copy the networkStream. So I increased the buffer and it worked.
Also, since my DB server is hosted locally I have got rid of the concurrent server status call back in the C# driver.