Spark Streaming does not clean files in /tmp - scala

I have a Spark Streaming job (2.3.1 - standalone cluster):
~50K RPS
10 executors - (2 core, 7.5Gb RAM, 10Gb disk GCP nodes)
Data rate ~20Mb/sec and job (while running) run ~0.5s on 1s batches.
The problem is that the temporary files that spark writes to /tmp are never cleaned out as the JVM on the executor never terminates. Short of some clever batch job to clean the /tmp directory I am looking to find the proper way to keep my job from crashing (and restarting) on no space left of device errors.
I have the following SPARK_WORKER_OPTS set as follows:
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=1200 -Dspark.worker.cleanup.appDataTtl=345600"
I have experimented with both CMS and G1GC collectors - neither seemed to have an impacted other than modulating GC time.
I have been through most of the documented settings, and searched about but have not been able to find any additional directions to try. I have to believe that ppl are running much bigger jobs with a long running, stable stream and that this is a solved problem.
Other config bits:
spark-defaults:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.broadcast.compress true
spark.streaming.stopGracefullyOnShutdown true
spark.ui.reverseProxy true
spark.cleaner.periodicGC.interval 20
...
spark-submit: (nothing special)
/opt/spark/bin/spark-submit \
--master spark://6.7.8.9:6066 \
--deploy-mode cluster \
--supervise \
/opt/application/switch.jar
As it stands the job runs for ~90 minutes before the drives fill up and we crash. I could spin them up with larger drives, but 90 minutes should allow for the config options I've tried to have a go at cleanup at 20 minute intervals. Larger drives would likely just prolong the issue.

Drives filling up turned out to be a red herring. A cached DataSet inside the job was not being released from memory after ds.cache. Eventually things spilled to disk and our no remaining drive space condition resulted. Or, bit by premature optimization. Removing the explicit call to cache resolved the issue and performance was not impacted.

Related

How to pick number of executors , cores for each executor and executor memory

10 Node cluster, each machine has 16 cores and 126.04 GB of RAM
Application input dataset is around 1TB with 10-15 files and there is some aggregation(groupBy)
Job will run using Yarn as resource scheduler
My Question how to pick num-executors, executor-memory, executor-core, driver-memory, driver-cores?
I tend to use this tool - http://spark-configuration.luminousmen.com/ , for profiling my Spark Jobs , the process does take some hit and try , but it helps in the longer run
Additionally you can understand how Spark Memory works - https://luminousmen.com/post/dive-into-spark-memory

K8s cluster memory decreases when running an Apache Flink Job

We are trying to deploy an apache Flink job on a K8s Cluster, but we are noticing an odd behavior, when we start our job, the task manager memory starts with the amount assigned, in our case is 3 GB.
taskmanager.memory.process.size: 3g
eventually, the memory starts decreasing until it reaches about 160 MB, at that point, it recovers a little memory so it doesn't reach its end.
that very low memory often causes that the job is terminated due to task manager heartbeat exception even when trying to watch the logs on Flink dashboard or doing the job's process.
Why is it going so low on memory? we expected to have that behavior but in the range of GB because we assigned those 3Gb to the task manager even if we change our task manager memory size we have the same behavior.
Our Flink conf looks like this:
flink-conf.yaml: |+
taskmanager.numberOfTaskSlots: 1
blob.server.port: 6124
taskmanager.rpc.port: 6122
taskmanager.memory.process.size: 3g
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9999
metrics.system-resource: true
metrics.system-resource-probing-interval: 5000
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
is there a recommended configuration on K8s for memory or something that we are missing on our flink-conf.yml?
Thanks.
Your configuration looks fine. It's most likely an issue with your code and some kind of memory leak. This is a very good answer describing what may be the problem.
You can try setting a limit on the JVM heap with taskmanager.memory.task.heap.size that you give the JVM some extra room to do GC, etc. But in the end, if you are allocating something that is not being referenced you will run into the situation.
Presumably, you are using your memory to store your state in which case you can also try RockDB as a state backend in case you are storing large objects.
What are your requests/limits in you deployment templates? If there are no specified request sizes you may be seeing your cluster resources get eaten.

Zombie giant unkillable task blocks Druid at restart

I'm running Apache Druid 0.17 deploying with nohup ./bin/start-nano-quickstart > mylog.log. As the deep storage I am using s3 and I have parquet extension enabled and all work fine. I could ingest with several small spark partitioned parquet datasources from s3 correctly. All the remaining configurations are untouched.
As I tried loading a giant datasource to test the performance and resource usage the task died after a couple of hours because of OutOfMemory.(It was expected)
2020-02-07T17:32:20,519 INFO [task-runner-0-priority-0] org.apache.druid.segment.realtime.appenderator.BaseAppenderatorDriver - New segment[arc_2016-09-29T12:00:00.000Z_2016-09-29T13:00:00.000Z_2020-02-07T17:22:45.965Z] for sequenceName[index_parallel_arc_chgindko_2020-02-07T14:59:32.337Z].
Terminating due to java.lang.OutOfMemoryError: GC overhead limit exceeded
Now every time I restart Druid, it starts that giant task and it is impossible to kill it. Even when the task apparently dies or turns in waiting status the CPU usage is about 140% and I cannot submit new tasks to Druid. I tried to access the Derby database manually to find the task and remove it but I was not successful and this solution is really nasty. I know that I can change the database in the configuration so the next time I will have a fresh Druid but it is not a good solution as I will miss all other datasources. How can I get ready of this long running zombie task?

Spark is not using all configured memory

Starting spark in standalone client mode on 10 nodes cluster using Spark-2.1.0-SNAPSHOT.
9 nodes are workers, 10th is master and driver. Each 256GB of memory.
I'm having difficuilty to utilize my cluster fully.
I'm setting up memory limit for executors and driver to 200GB using following parameters to spark-shell:
spark-shell --executor-memory 200g --driver-memory 200g --conf spark.driver.maxResultSize=200g
When my application starts I can see those values set as expected both in console and in spark web UI /environment/ tab.
But when I go to /executors/ tab then I see that my nodes got only 114.3GB storage memory assigned, see screen below.
Total memory shown here is then 1.1TB while I would expect to have 2TB. I double checked that other processes were not using the memory.
Any idea what is the source of that discrepancy? Did I miss some setting? Is it a bug in /executors/ tab or spark engine?
You're fully utilizing the memory, but here you are only looking at the storage portion of the memory. By default, the storage portion is 60% of the total memory.
From Spark Docs
Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster.
As of Spark 1.6, the execution memory and the storage memory is shared, so it's unlikely that you would need to tune the memory.fraction parameter.
If you're using yarn, the main page of the resource manager's "Memory Used" and "Memory Total" will signify the total memory usage.

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html