spark 2.2 cache() cause the driver OutOfMemoryerror - scala

I'm running Spark 2.2 with Scala on AWS EMR (Zeppling / spark-shell).
I'm trying to calculate very simple calculation: Loading, filtering, caching and counting on a large data set. My data contain 4,500 GB (4.8 TB) ORC format with 51,317,951,565 (51 billion) rows.
first I tried the process it with the following cluster:
1 master node - m4.xlarge - 4 cpu, 16 gb Mem
150 core nodes - r3.xlarge - 4 cpu, 29 gb Mem
150 tasks nodes - r3.xlarge - 4 cpu, 29 gb Mem
but it failed with OutOfMemoryError.
When I looked at Spark UI and Ganglia I saw that after the application load more than 80% of the data, the driver node getting too busy while the executors stop working (CPU usage is very low) until it crashed.
Ganglia CPU usage for master and worker nodes
then I tried to execute the same process just with increasing the driver node to:
1 master node - m4.2xlarge - 8 cpu, 31 gb Mem
and it succeeds.
I don't understand why the Driver node Memory usage is being fulfilled until it crashes. AFAIK only the executors are loading and processing the tasks and the data should not pass to the master. what could be the reason for it?
1) Ganglia Master Node usage for the second scenario
2) Spark UI stages
3) Spark UI DAG visualization
Below you can find the code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Dataset, SaveMode, SparkSession, DataFrame}
import org.apache.spark.sql.functions.{concat_ws, expr, lit, udf}
import org.apache.spark.storage.StorageLevel
val df = spark.sql("select * from default.level_1 where date_ >= ('2017-11-08') and date_ <= ('2017-11-27')")
.drop("carrier", "city", "connection_type", "geo_country", "geo_country","geo_lat","geo_lon","geo_lon","geo_type", "ip","keywords","language","lat","lon","store_category","GEO3","GEO4")
.where("GEO4 is not null")
.withColumn("is_away", lit(0))
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
df.count()
Below you can find the error message -
{"Event":"SparkListenerLogStart","Spark Version":"2.2.0"}
{"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor ID":"driver","Host":"10.44.6.179","Port":44257},"Maximum Memory":6819151872,"Timestamp":1512024674827,"Maximum Onheap Memory":6819151872,"Maximum Offheap Memory":0}
{"Event":"SparkListenerEnvironmentUpdate","JVM Information":{"Java Home":"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.32.amzn1.x86_64/jre","Java Version":"1.8.0_141 (Oracle Corporation)","Scala Version":"version 2.11.8"},"Spark Properties":{"spark.sql.warehouse.dir":"hdfs:///user/spark/warehouse","spark.yarn.dist.files":"file:/etc/spark/conf/hive-site.xml","spark.executor.extraJavaOptions":"-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'","spark.driver.host":"10.44.6.179","spark.history.fs.logDirectory":"hdfs:///var/log/spark/apps","spark.eventLog.enabled":"true","spark.driver.port":"33707","spark.shuffle.service.enabled":"true","spark.driver.extraLibraryPath":"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native","spark.repl.class.uri":"spark://10.44.6.179:33707/classes","spark.jars":"","spark.yarn.historyServer.address":"ip-10-44-6-179.ec2.internal:18080","spark.stage.attempt.ignoreOnDecommissionFetchFailure":"true","spark.repl.class.outputDir":"/mnt/tmp/spark-52cac1b4-614f-43a5-ab9b-5c60c6c1c5a5/repl-9389c888-603e-4988-9593-86e298d2514a","spark.app.name":"Spark shell","spark.scheduler.mode":"FIFO","spark.driver.memory":"11171M","spark.executor.instances":"200","spark.default.parallelism":"3200","spark.resourceManager.cleanupExpiredHost":"true","spark.executor.id":"driver","spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS":"$(hostname -f)","spark.driver.extraJavaOptions":"-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'","spark.submit.deployMode":"client","spark.master":"yarn","spark.ui.filters":"org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter","spark.blacklist.decommissioning.timeout":"1h","spark.executor.extraLibraryPath":"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native","spark.sql.hive.metastore.sharedPrefixes":"com.amazonaws.services.dynamodbv2","spark.executor.memory":"20480M","spark.driver.extraClassPath":"/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar","spark.home":"/usr/lib/spark","spark.eventLog.dir":"hdfs:///var/log/spark/apps","spark.dynamicAllocation.enabled":"true","spark.executor.extraClassPath":"/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar","spark.sql.catalogImplementation":"hive","spark.executor.cores":"8","spark.history.ui.port":"18080","spark.driver.appUIAddress":"http://ip-10-44-6-179.ec2.internal:4040","spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS":"ip-10-44-6-
Notes -
1) I tried to change the StorageLevel to cache() and DISK_ONLY and it didn't affect the result.
2) I checked the volume of the "scratch space" and I saw that more than 90% of it still not in use.
Thanks!!

I have some assumption that this may be caused by the mechanism inside spark SQL.
In short the spark SQL will collect all the broadcast dataset at driver side so that when you have a big query, the driver must have enough memory to hold the broadcasted data.
Related link to the issue

Related

Read a huge Oracle table using Spark

Spark version: 2.4.7
OS: Linux RHEL Fedora
I have the following use case:
Read an oracle table that contains ~150 million records (Done Daily), and then write these records to 800 files (using repartition) on a shared file system to be used in another application.
I can read the table into a data frame, but when trying to write it never finishes.
df
res38: org.apache.spark.sql.DataFrame = [ROW_ID: string, CREATED: string ... 276 more fields]
When I limit the number of retrieved records to 1 million, and repartition by (6) the process to read and write takes 2-3 minutes.
I searched for insights on the issue but couldn't figure it out, when running the process on the full data set I see in the UI logs the following line (keeps repeating) :
Blockquote
INFO sort.UnsafeExternalSorter: Thread 135 spilling sort data of 5.2 GB to disk (57 times so far)
I submit the job using the following:
time spark-submit --verbose --conf spark.dynamicAllocation.enabled=false --conf spark.spark.sql.broadcastTimeout=1000 --conf spark.sql.shuffle.partitions=1500 --conf "spark.ui.enabled=true" --master yarn --driver-memory 60G --executor-memory 10G --num-executors 40 --executor-cores 8 --jars ojdbc6.jar SparkOracleExtractor.jar
Please note there are enough resources on the cluster and the queue is only 5.0% used, and a constraint is to use Spark.
Appreciate any help on where to get information to resolve the issue and speed up the process.
This is the code:
val myquery = "select * from mytable"
val dt=20221023
spark.read.format("jdbc").
option("url", s"jdbc:oracle:thin:#$db_connect_string").
option("driver", "oracle.jdbc.driver.OracleDriver").
option("query", myquery).
option("user", db_user).
option("password", db_pass).
option("fetchsize", 10000).
option("delimiter", "|").
load()
df.repartition(800).write.csv(s"file:///fs/extrat_path/${dt}")
These are the shuffle spill sizes after 2.5 hours:
Shuffle Spill (Memory) Shuffle Spill (Disk)
259.4 GB 22.0 GB

pandas.DataFrame.to_parquet() memory overflow in Airflow with Kubernetes Executor

I get OOM errors from Airflow when trying to write a DataFrame to a parquet file.
The dataframe is quite small, and the available memory is about 5GB
df_chunk is a chunk of a pandas.DataFrame with about 2000 rows
df_chunk.memory_usage(deep=True).sum()/1024/1024
yields only about 17MB of memory used
with tempfile.NamedTemporaryFile(delete=True, mode='w+b', buffering=io.DEFAULT_BUFFER_SIZE) as tmp_file:
print("Writing DF to tmp file")
df_chunk.to_parquet(tmp_file.name,compression='gzip')
print("Flushing tmp_file")
tmp_file.flush()
when I reduce the chunk size to about 500 rows, I do not get an issue.
I tried to run this in a separete process with 2GB allocated through the resource library. In that case I am getting a realloc failed error.

Simple spark job fail due to GC overhead limit

I've created a standalone spark (2.1.1) cluster on my local machines
with 9 cores / 80G each machine (total of 27 cores / 240G Ram)
I've got a sample spark job that sum all the numbers from 1 to x
this is the code :
package com.example
import org.apache.spark.sql.SparkSession
object ExampleMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("spark://192.168.1.2:7077")
.config("spark.driver.maxResultSize" ,"3g")
.appName("ExampleApp")
.getOrCreate()
val sc = spark.SparkContext
val rdd = sc.parallelize(Lisst.range(1, 1000))
val sum = rdd.reduce((a,b) => a+b)
println(sum)
done
}
def done = {
println("\n\n")
println("-------- DONE --------")
}
}
When running the above code I get results after a few seconds
so I've crancked up the code to sum all the numbers from 1 to 1B (1,000,000,000) and than I get GC overhead limit reached
I read that spark should spill memory to the HDD if there isn't enough memory, I've tried to play with my cluster configuration but that didn't helped.
Driver memory = 6G
Number of workers = 24
Cores per worker = 1
Memory per worker = 10
I'm not a developer, and have no knowledge in Scala but would like to find a solution to run this code without GC issues.
Per #philantrovert request I'm adding my spark-submit command
/opt/spark-2.1.1/bin/spark-submit \
--class "com.example.ExampleMain" \
--master spark://192.168.1.2:6066 \
--deploy-mode cluster \
/mnt/spark-share/example_2.11-1.0.jar
In addition my spark/conf are as following:
slaves file contain the 3 IP addresses of my nodes (including the master)
spark-defaults contain:
spark.master spark://192.168.1.2:7077
spark.driver.memory 10g
spark-env.sh contain:
SPARK_LOCAL_DIRS= shared folder among all nodes
SPARK_EXECUTOR_MEMORY=10G
SPARK_DRIVER_MEMORY=10G
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=10G
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_DIR= shared folder among all nodes
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
Thanks
I suppose the problem is that you create a List with 1 Billion entries on the driver, which is a huge datastructure (4GB). There is a more efficient way the programmatically create an Dataset/RDD:
val rdd = spark.range(1000000000L).rdd

Spark Application - High "Executor Computing Time"

I have a Spark application that is now running for 46 hours. While majority of its jobs complete within 25 seconds, specific jobs take hours. Some details are provided below:
Task Time Shuffle Read Shuffle Write
7.5 h 2.2 MB / 257402 2.9 MB / 128601
There are other similar task times off-course having values of 11.3 h, 10.6 h, 9.4 h etc. each of them spending bulk of the activity time on "rdd at DataFrameFunctions.scala:42.". Details for the stage reveals that the time spent by executor on "Executor Computing time". This executor runs at DataNode 1, where the CPU utilization is very normal about 13%. Other boxes (4 more worker nodes) have very nominal CPU utilization.
When the Shuffle Read is within 5000 records, this is extremely fast and completes with 25 seconds, as stated previously. Nothing is appended to the logs (spark/hadoop/hbase), neither anything is noticed at /tmp or /var/tmp location which will indicate some disk related activity is in progress.
I am clueless about what is going wrong. Have been struggling with this for quite some time now. The versions of software used are as follows:
Hadoop : 2.7.2
Zookeeper : 3.4.9
Kafka : 2.11-0.10.1.1
Spark : 2.1.0
HBase : 1.2.6
Phoenix : 4.10.0
Some configurations on the spark default file.
spark.eventLog.enabled true
spark.eventLog.dir hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.history.fs.logDirectory hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.yarn.jars hdfs://SDCHDPMAST1:8111/user/appuser/spark/share/lib/*.jar
spark.driver.maxResultSize 5G
spark.deploy.zookeeper.url SDCZKPSRV01
spark.executor.memory 12G
spark.driver.memory 10G
spark.executor.heartbeatInterval 60s
spark.network.timeout 300s
Is there any way I can reduce the time spent on "Executor Computing time"?
The job performing on the specific dataset is skewed. Because of the skewness, jobs are taking more than expected.

is there a way to optimize spark sql code?

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks
Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.