Spark creating index in elasticsearch results in OutOfMemoryError - scala

In a spark job we are creating index in elasticsearch from a file 0f ~120GB. The file is devided into 3830 patitions.After job starts putting data into elasticsearch index, garbage collector messages appear in the elasticsearch logs ending with java.lang.OutOfMemoryError: Java heap space.
[WARN ][monitor.jvm] [gc][old][209][6] duration [57.6s], collections [1]/[57.9s], total [57.6s]/[3.7m], memory [24.2gb]->[18.7gb]/[24.9gb], all_pools {[young] [8.6mb]->[10.3mb]/[532.5mb]}{[survivor] [66.5mb]->[0b]/[66.5mb]}{[old] [24.2gb]->[18.7gb]/[24.3gb]}
java.lang.OutOfMemoryError: Java heap space
I restarted the elasticsearch nodes and ran job again but, the result was same.
Following are elasticsearch properties in spark config
conf.set("es.nodes", "xx.xx.xxx.xx:9200")
conf.set("es.scroll.size", "10000")
conf.set("es.index.auto.create", "true")
We are using Spark 1.6 and elasticsearch 2.0.2. There are three elasticsearch nodes in the cluster each with 25 GB heap space.

Related

Apache Beam Pipeline: OutOfMemoryException while writing Avro files to Google Cloud Storage with Google Dataflow

We have a batch pipeline developed in Apache Beam Java SDK 2.34.0 and running with Google Cloud Dataflow runner. We have a step to write avro files. Avro write is throwing OutofMemory exception. Batch is trying to write around 800 avro files, each file not more than 50kb.
Error message from worker: An OutOfMemoryException occurred. Consider specifying higher memory instances in PipelineOptions.
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:60)
org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.finishBundle(WriteFiles.java:974)
Caused by: java.lang.OutOfMemoryError: Java heap space
com.google.api.client.googleapis.media.MediaHttpUploader.buildContentChunk(MediaHttpUploader.java:579)
com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:380)
com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:308)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:528)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel$UploadOperation.call(AbstractGoogleAsyncWriteChannel.java:85)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base/java.lang.Thread.run(Thread.java:834)
Configurations:
WorkerType: n2-standard-4
numShards: 10
Writer configuration:
public static Write<String, Evaluation> getAvroWriter(ValueProvider<String> avroFilesPath,
ValueProvider<Integer> shards) {
return FileIO.<String, Evaluation>writeDynamic().withNumShards(shards)
.by(Evaluation::getId).withDestinationCoder(StringUtf8Coder.of())
.withNaming(Contextful.fn(fileName -> (window, pane, numShards, shardIndex, compression) -> {
return fileName + ".avro";
})).via(AvroIO.sink(Evaluation.class)).to(avroFilesPath);
}
Took heap dump for inspection, Surprised by seeing memory used by stream/byte[].
heap dump dominator tree
top_consumer_html
Is something wrong in Apache Beam IO Library with dataflow runner?

pandas.DataFrame.to_parquet() memory overflow in Airflow with Kubernetes Executor

I get OOM errors from Airflow when trying to write a DataFrame to a parquet file.
The dataframe is quite small, and the available memory is about 5GB
df_chunk is a chunk of a pandas.DataFrame with about 2000 rows
df_chunk.memory_usage(deep=True).sum()/1024/1024
yields only about 17MB of memory used
with tempfile.NamedTemporaryFile(delete=True, mode='w+b', buffering=io.DEFAULT_BUFFER_SIZE) as tmp_file:
print("Writing DF to tmp file")
df_chunk.to_parquet(tmp_file.name,compression='gzip')
print("Flushing tmp_file")
tmp_file.flush()
when I reduce the chunk size to about 500 rows, I do not get an issue.
I tried to run this in a separete process with 2GB allocated through the resource library. In that case I am getting a realloc failed error.

HBase MemStore and BlockCache exceeds the threshold

I want to connect spark jobs with my remote HBase. I got the following jars:
--jars spark-hbase-connector_2.10-1.0.3.jar,hbase-shaded-protobuf-3.4.1.jar,hbase-0.90.3.jar
So I am importing protobuff, connector to HBase (https://github.com/nerdammer/spark-hbase-connector) and hbase itself. However I get:
java.lang.RuntimeException: Current heap configuration for MemStore
and BlockCache exceeds the threshold required for successful cluster
operation. The combined value cannot exceed 0.8. Please check the
settings for hbase.regionserver.global.memstore.upperLimit and
hfile.block.cache.size in your configuration.
In the hbase-0.90.3.jar, the hbase-default.xml specifies
hbase.regionserver.global.memstore.upperLimit
0.4
Maximum size of all memstores in a region server before new
updates are blocked and flushes are forced. Defaults to 40% of heap
and
hfile.block.cache.size
0.2
Percentage of maximum heap (-Xmx setting) to allocate to block cache
used by HFile/StoreFile. Default of 0.2 means allocate 20%.
Set to 0 to disable.
0.4+0.2 isn't more than 0.8. Is there anything that can cause this error?
In my code I specify
sparkConf.set("spark.hbase.host", "remoteHost")
The port to zookeeper in hbase-defaults is 2181 and it is the same in my hbase on remote server.
Thanks in advance

spark 2.2 cache() cause the driver OutOfMemoryerror

I'm running Spark 2.2 with Scala on AWS EMR (Zeppling / spark-shell).
I'm trying to calculate very simple calculation: Loading, filtering, caching and counting on a large data set. My data contain 4,500 GB (4.8 TB) ORC format with 51,317,951,565 (51 billion) rows.
first I tried the process it with the following cluster:
1 master node - m4.xlarge - 4 cpu, 16 gb Mem
150 core nodes - r3.xlarge - 4 cpu, 29 gb Mem
150 tasks nodes - r3.xlarge - 4 cpu, 29 gb Mem
but it failed with OutOfMemoryError.
When I looked at Spark UI and Ganglia I saw that after the application load more than 80% of the data, the driver node getting too busy while the executors stop working (CPU usage is very low) until it crashed.
Ganglia CPU usage for master and worker nodes
then I tried to execute the same process just with increasing the driver node to:
1 master node - m4.2xlarge - 8 cpu, 31 gb Mem
and it succeeds.
I don't understand why the Driver node Memory usage is being fulfilled until it crashes. AFAIK only the executors are loading and processing the tasks and the data should not pass to the master. what could be the reason for it?
1) Ganglia Master Node usage for the second scenario
2) Spark UI stages
3) Spark UI DAG visualization
Below you can find the code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Dataset, SaveMode, SparkSession, DataFrame}
import org.apache.spark.sql.functions.{concat_ws, expr, lit, udf}
import org.apache.spark.storage.StorageLevel
val df = spark.sql("select * from default.level_1 where date_ >= ('2017-11-08') and date_ <= ('2017-11-27')")
.drop("carrier", "city", "connection_type", "geo_country", "geo_country","geo_lat","geo_lon","geo_lon","geo_type", "ip","keywords","language","lat","lon","store_category","GEO3","GEO4")
.where("GEO4 is not null")
.withColumn("is_away", lit(0))
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
df.count()
Below you can find the error message -
{"Event":"SparkListenerLogStart","Spark Version":"2.2.0"}
{"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor ID":"driver","Host":"10.44.6.179","Port":44257},"Maximum Memory":6819151872,"Timestamp":1512024674827,"Maximum Onheap Memory":6819151872,"Maximum Offheap Memory":0}
{"Event":"SparkListenerEnvironmentUpdate","JVM Information":{"Java Home":"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.32.amzn1.x86_64/jre","Java Version":"1.8.0_141 (Oracle Corporation)","Scala Version":"version 2.11.8"},"Spark Properties":{"spark.sql.warehouse.dir":"hdfs:///user/spark/warehouse","spark.yarn.dist.files":"file:/etc/spark/conf/hive-site.xml","spark.executor.extraJavaOptions":"-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'","spark.driver.host":"10.44.6.179","spark.history.fs.logDirectory":"hdfs:///var/log/spark/apps","spark.eventLog.enabled":"true","spark.driver.port":"33707","spark.shuffle.service.enabled":"true","spark.driver.extraLibraryPath":"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native","spark.repl.class.uri":"spark://10.44.6.179:33707/classes","spark.jars":"","spark.yarn.historyServer.address":"ip-10-44-6-179.ec2.internal:18080","spark.stage.attempt.ignoreOnDecommissionFetchFailure":"true","spark.repl.class.outputDir":"/mnt/tmp/spark-52cac1b4-614f-43a5-ab9b-5c60c6c1c5a5/repl-9389c888-603e-4988-9593-86e298d2514a","spark.app.name":"Spark shell","spark.scheduler.mode":"FIFO","spark.driver.memory":"11171M","spark.executor.instances":"200","spark.default.parallelism":"3200","spark.resourceManager.cleanupExpiredHost":"true","spark.executor.id":"driver","spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS":"$(hostname -f)","spark.driver.extraJavaOptions":"-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'","spark.submit.deployMode":"client","spark.master":"yarn","spark.ui.filters":"org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter","spark.blacklist.decommissioning.timeout":"1h","spark.executor.extraLibraryPath":"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native","spark.sql.hive.metastore.sharedPrefixes":"com.amazonaws.services.dynamodbv2","spark.executor.memory":"20480M","spark.driver.extraClassPath":"/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar","spark.home":"/usr/lib/spark","spark.eventLog.dir":"hdfs:///var/log/spark/apps","spark.dynamicAllocation.enabled":"true","spark.executor.extraClassPath":"/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar","spark.sql.catalogImplementation":"hive","spark.executor.cores":"8","spark.history.ui.port":"18080","spark.driver.appUIAddress":"http://ip-10-44-6-179.ec2.internal:4040","spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS":"ip-10-44-6-
Notes -
1) I tried to change the StorageLevel to cache() and DISK_ONLY and it didn't affect the result.
2) I checked the volume of the "scratch space" and I saw that more than 90% of it still not in use.
Thanks!!
I have some assumption that this may be caused by the mechanism inside spark SQL.
In short the spark SQL will collect all the broadcast dataset at driver side so that when you have a big query, the driver must have enough memory to hold the broadcasted data.
Related link to the issue

is there a way to optimize spark sql code?

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks
Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.