Spark Application - High "Executor Computing Time" - scala

I have a Spark application that is now running for 46 hours. While majority of its jobs complete within 25 seconds, specific jobs take hours. Some details are provided below:
Task Time Shuffle Read Shuffle Write
7.5 h 2.2 MB / 257402 2.9 MB / 128601
There are other similar task times off-course having values of 11.3 h, 10.6 h, 9.4 h etc. each of them spending bulk of the activity time on "rdd at DataFrameFunctions.scala:42.". Details for the stage reveals that the time spent by executor on "Executor Computing time". This executor runs at DataNode 1, where the CPU utilization is very normal about 13%. Other boxes (4 more worker nodes) have very nominal CPU utilization.
When the Shuffle Read is within 5000 records, this is extremely fast and completes with 25 seconds, as stated previously. Nothing is appended to the logs (spark/hadoop/hbase), neither anything is noticed at /tmp or /var/tmp location which will indicate some disk related activity is in progress.
I am clueless about what is going wrong. Have been struggling with this for quite some time now. The versions of software used are as follows:
Hadoop : 2.7.2
Zookeeper : 3.4.9
Kafka : 2.11-0.10.1.1
Spark : 2.1.0
HBase : 1.2.6
Phoenix : 4.10.0
Some configurations on the spark default file.
spark.eventLog.enabled true
spark.eventLog.dir hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.history.fs.logDirectory hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.yarn.jars hdfs://SDCHDPMAST1:8111/user/appuser/spark/share/lib/*.jar
spark.driver.maxResultSize 5G
spark.deploy.zookeeper.url SDCZKPSRV01
spark.executor.memory 12G
spark.driver.memory 10G
spark.executor.heartbeatInterval 60s
spark.network.timeout 300s
Is there any way I can reduce the time spent on "Executor Computing time"?

The job performing on the specific dataset is skewed. Because of the skewness, jobs are taking more than expected.

Related

Read a huge Oracle table using Spark

Spark version: 2.4.7
OS: Linux RHEL Fedora
I have the following use case:
Read an oracle table that contains ~150 million records (Done Daily), and then write these records to 800 files (using repartition) on a shared file system to be used in another application.
I can read the table into a data frame, but when trying to write it never finishes.
df
res38: org.apache.spark.sql.DataFrame = [ROW_ID: string, CREATED: string ... 276 more fields]
When I limit the number of retrieved records to 1 million, and repartition by (6) the process to read and write takes 2-3 minutes.
I searched for insights on the issue but couldn't figure it out, when running the process on the full data set I see in the UI logs the following line (keeps repeating) :
Blockquote
INFO sort.UnsafeExternalSorter: Thread 135 spilling sort data of 5.2 GB to disk (57 times so far)
I submit the job using the following:
time spark-submit --verbose --conf spark.dynamicAllocation.enabled=false --conf spark.spark.sql.broadcastTimeout=1000 --conf spark.sql.shuffle.partitions=1500 --conf "spark.ui.enabled=true" --master yarn --driver-memory 60G --executor-memory 10G --num-executors 40 --executor-cores 8 --jars ojdbc6.jar SparkOracleExtractor.jar
Please note there are enough resources on the cluster and the queue is only 5.0% used, and a constraint is to use Spark.
Appreciate any help on where to get information to resolve the issue and speed up the process.
This is the code:
val myquery = "select * from mytable"
val dt=20221023
spark.read.format("jdbc").
option("url", s"jdbc:oracle:thin:#$db_connect_string").
option("driver", "oracle.jdbc.driver.OracleDriver").
option("query", myquery).
option("user", db_user).
option("password", db_pass).
option("fetchsize", 10000).
option("delimiter", "|").
load()
df.repartition(800).write.csv(s"file:///fs/extrat_path/${dt}")
These are the shuffle spill sizes after 2.5 hours:
Shuffle Spill (Memory) Shuffle Spill (Disk)
259.4 GB 22.0 GB

HBase MemStore and BlockCache exceeds the threshold

I want to connect spark jobs with my remote HBase. I got the following jars:
--jars spark-hbase-connector_2.10-1.0.3.jar,hbase-shaded-protobuf-3.4.1.jar,hbase-0.90.3.jar
So I am importing protobuff, connector to HBase (https://github.com/nerdammer/spark-hbase-connector) and hbase itself. However I get:
java.lang.RuntimeException: Current heap configuration for MemStore
and BlockCache exceeds the threshold required for successful cluster
operation. The combined value cannot exceed 0.8. Please check the
settings for hbase.regionserver.global.memstore.upperLimit and
hfile.block.cache.size in your configuration.
In the hbase-0.90.3.jar, the hbase-default.xml specifies
hbase.regionserver.global.memstore.upperLimit
0.4
Maximum size of all memstores in a region server before new
updates are blocked and flushes are forced. Defaults to 40% of heap
and
hfile.block.cache.size
0.2
Percentage of maximum heap (-Xmx setting) to allocate to block cache
used by HFile/StoreFile. Default of 0.2 means allocate 20%.
Set to 0 to disable.
0.4+0.2 isn't more than 0.8. Is there anything that can cause this error?
In my code I specify
sparkConf.set("spark.hbase.host", "remoteHost")
The port to zookeeper in hbase-defaults is 2181 and it is the same in my hbase on remote server.
Thanks in advance

spark 2.2 cache() cause the driver OutOfMemoryerror

I'm running Spark 2.2 with Scala on AWS EMR (Zeppling / spark-shell).
I'm trying to calculate very simple calculation: Loading, filtering, caching and counting on a large data set. My data contain 4,500 GB (4.8 TB) ORC format with 51,317,951,565 (51 billion) rows.
first I tried the process it with the following cluster:
1 master node - m4.xlarge - 4 cpu, 16 gb Mem
150 core nodes - r3.xlarge - 4 cpu, 29 gb Mem
150 tasks nodes - r3.xlarge - 4 cpu, 29 gb Mem
but it failed with OutOfMemoryError.
When I looked at Spark UI and Ganglia I saw that after the application load more than 80% of the data, the driver node getting too busy while the executors stop working (CPU usage is very low) until it crashed.
Ganglia CPU usage for master and worker nodes
then I tried to execute the same process just with increasing the driver node to:
1 master node - m4.2xlarge - 8 cpu, 31 gb Mem
and it succeeds.
I don't understand why the Driver node Memory usage is being fulfilled until it crashes. AFAIK only the executors are loading and processing the tasks and the data should not pass to the master. what could be the reason for it?
1) Ganglia Master Node usage for the second scenario
2) Spark UI stages
3) Spark UI DAG visualization
Below you can find the code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Dataset, SaveMode, SparkSession, DataFrame}
import org.apache.spark.sql.functions.{concat_ws, expr, lit, udf}
import org.apache.spark.storage.StorageLevel
val df = spark.sql("select * from default.level_1 where date_ >= ('2017-11-08') and date_ <= ('2017-11-27')")
.drop("carrier", "city", "connection_type", "geo_country", "geo_country","geo_lat","geo_lon","geo_lon","geo_type", "ip","keywords","language","lat","lon","store_category","GEO3","GEO4")
.where("GEO4 is not null")
.withColumn("is_away", lit(0))
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
df.count()
Below you can find the error message -
{"Event":"SparkListenerLogStart","Spark Version":"2.2.0"}
{"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor ID":"driver","Host":"10.44.6.179","Port":44257},"Maximum Memory":6819151872,"Timestamp":1512024674827,"Maximum Onheap Memory":6819151872,"Maximum Offheap Memory":0}
{"Event":"SparkListenerEnvironmentUpdate","JVM Information":{"Java Home":"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.32.amzn1.x86_64/jre","Java Version":"1.8.0_141 (Oracle Corporation)","Scala Version":"version 2.11.8"},"Spark Properties":{"spark.sql.warehouse.dir":"hdfs:///user/spark/warehouse","spark.yarn.dist.files":"file:/etc/spark/conf/hive-site.xml","spark.executor.extraJavaOptions":"-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'","spark.driver.host":"10.44.6.179","spark.history.fs.logDirectory":"hdfs:///var/log/spark/apps","spark.eventLog.enabled":"true","spark.driver.port":"33707","spark.shuffle.service.enabled":"true","spark.driver.extraLibraryPath":"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native","spark.repl.class.uri":"spark://10.44.6.179:33707/classes","spark.jars":"","spark.yarn.historyServer.address":"ip-10-44-6-179.ec2.internal:18080","spark.stage.attempt.ignoreOnDecommissionFetchFailure":"true","spark.repl.class.outputDir":"/mnt/tmp/spark-52cac1b4-614f-43a5-ab9b-5c60c6c1c5a5/repl-9389c888-603e-4988-9593-86e298d2514a","spark.app.name":"Spark shell","spark.scheduler.mode":"FIFO","spark.driver.memory":"11171M","spark.executor.instances":"200","spark.default.parallelism":"3200","spark.resourceManager.cleanupExpiredHost":"true","spark.executor.id":"driver","spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS":"$(hostname -f)","spark.driver.extraJavaOptions":"-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'","spark.submit.deployMode":"client","spark.master":"yarn","spark.ui.filters":"org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter","spark.blacklist.decommissioning.timeout":"1h","spark.executor.extraLibraryPath":"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native","spark.sql.hive.metastore.sharedPrefixes":"com.amazonaws.services.dynamodbv2","spark.executor.memory":"20480M","spark.driver.extraClassPath":"/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar","spark.home":"/usr/lib/spark","spark.eventLog.dir":"hdfs:///var/log/spark/apps","spark.dynamicAllocation.enabled":"true","spark.executor.extraClassPath":"/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar","spark.sql.catalogImplementation":"hive","spark.executor.cores":"8","spark.history.ui.port":"18080","spark.driver.appUIAddress":"http://ip-10-44-6-179.ec2.internal:4040","spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS":"ip-10-44-6-
Notes -
1) I tried to change the StorageLevel to cache() and DISK_ONLY and it didn't affect the result.
2) I checked the volume of the "scratch space" and I saw that more than 90% of it still not in use.
Thanks!!
I have some assumption that this may be caused by the mechanism inside spark SQL.
In short the spark SQL will collect all the broadcast dataset at driver side so that when you have a big query, the driver must have enough memory to hold the broadcasted data.
Related link to the issue

is there a way to optimize spark sql code?

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks
Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.

Why is "Error communicating with MapOutputTracker" reported when Spark tries to send GetMapOutputStatuses?

I'm using Spark 1.3 to do an aggregation on a lot of data. The job consists of 4 steps:
Read a big (1TB) sequence file (corresponding to 1 day of data)
Filter out most of it and get about 1GB of shuffle write
keyBy customer
aggregateByKey() to a custom structure that build a profile for that customer, corresponding to a HashMap[Long, Float] per customer. The Long keys are unique and never bigger than 50K distinct entries.
I'm running this with this configuration:
--name geo-extract-$1-askTimeout \
--executor-cores 8 \
--num-executors 100 \
--executor-memory 40g \
--driver-memory 4g \
--driver-cores 8 \
--conf 'spark.storage.memoryFraction=0.25' \
--conf 'spark.shuffle.memoryFraction=0.35' \
--conf 'spark.kryoserializer.buffer.max.mb=1024' \
--conf 'spark.akka.frameSize=1024' \
--conf 'spark.akka.timeout=200' \
--conf 'spark.akka.askTimeout=111' \
--master yarn-cluster \
And getting this error:
org.apache.spark.SparkException: Error communicating with MapOutputTracker
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:117)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:164)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
...
Caused by: org.apache.spark.SparkException: Error sending message [message = GetMapOutputStatuses(0)]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:113)
... 21 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
The job and the logic have been shown to work with a small test set and I can even run this job for some dates but not for others. I've googled around and found hints that "Error communicating with MapOutputTracker" is related to internal Spark messages, but I already increased "spark.akka.frameSize", "spark.akka.timeout" and "spark.akka.askTimeout" (this last one does not even appear on Spark documentation, but was mentioned in the Spark mailing list), to no avail. There is still some timeout going on at 30 seconds that I have no clue how to identify or fix.
I see no reason for this to fail due to data size, as the filtering operation and the fact that aggregateByKey performs local partial aggregations should be enough to address the data size. The number of tasks is 16K (automatic from the original input), much more than the 800 cores that are running this, on 100 executors, so it is not as simple as the usual "increment partitions" tip. Any clues would be greatly appreciated! Thanks!
I had a similar issue, that my job would work fine with a smaller dataset, but will fail with larger ones.
After a lot of configuration changes, I found that the changing the driver memory settings has much more of an impact than changing the executor memory settings.
Also using the new garbage collector helps a lot. I am using the following configuration for a cluster of 3, with 40 cores each. Hope the following config helps:
spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:NewRatio=3 -
XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:MaxPermSize=4g
-XX:PermSize=1G -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:NewRatio=3 -
XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:MaxPermSize=4g
-XX:PermSize=1G -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions
spark.driver.memory=8g
spark.driver.cores=10
spark.driver.maxResultSize=8g
spark.executor.memory=16g
spark.executor.cores=25
spark.default.parallelism=50
spark.eventLog.dir=hdfs://mars02-db01/opt/spark/logs
spark.eventLog.enabled=true
spark.kryoserializer.buffer=512m
spark.kryoserializer.buffer.max=1536m
spark.rdd.compress=true
spark.storage.memoryFraction=0.15
spark.storage.MemoryStore=12g
What's going on in the driver at the time of this failure? It could be due to memory pressure on the driver causing it to be unresponsive. If I recall correctly, the MapOutputTracker that it's trying to get to when it calls GetMapOutputStatuses is running in the Spark driver driver process.
If you're facing long GCs or other pauses for some reason in that process this would cause the exceptions you're seeing above.
Some things to try would be to try jstacking the driver process when you start seeing these errors and see what happens. If jstack doesn't respond, it could be that your driver isn't sufficiently responsive.
16K tasks does sound like it would be a lot for the driver to keep track of, any chance you can increase the driver memory past 4g?
Try the following property
spark.shuffle.reduceLocality.enabled = false.
Refer to this link.
https://issues.apache.org/jira/browse/SPARK-13631