I've been searching a lot for a solution for the following issue. I'm using Scala 2.11.8 and Spark 2.1.0.
Application application_1489191400413_3294 failed 1 times due to AM Container for appattempt_1489191400413_3294_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-31-17-35.us-west-2.compute.internal:8088/cluster/app/application_1489191400413_3294Then, click on links to logs of each attempt.
Diagnostics: Container [pid=23372,containerID=container_1489191400413_3294_01_000001] is running beyond physical memory limits.
Current usage: 1.4 GB of 1.4 GB physical memory used; 3.5 GB of 6.9 GB virtual memory used. Killing container.
Note that I've allotted a lot more than the 1.4 GB being reported in the error here. Since I see none of my executors failing, my read from this error was this the driver needs more memory. However, my settings don't seem to be propagating through.
I'm setting job parameters to yarn as follows:
val conf = new SparkConf()
.setAppName(jobName)
.set("spark.hadoop.mapred.output.committer.class", "com.company.path.DirectOutputCommitter")
additionalSparkConfSettings.foreach { case (key, value) => conf.set(key, value) }
// this is the implicit that we pass around
implicit val sparkSession = SparkSession
.builder()
.appName(jobName)
.config(conf)
.getOrCreate()
where the memory provisioning parameters in additionalSparkConfSettings were set with the following snippet:
HashMap[String, String](
"spark.driver.memory" -> "8g",
"spark.executor.memory" -> "8g",
"spark.executor.cores" -> "5",
"spark.driver.cores" -> "2",
"spark.yarn.maxAppAttempts" -> "1",
"spark.yarn.driver.memoryOverhead" -> "8192",
"spark.yarn.executor.memoryOverhead" -> "2048"
)
Are my settings really not propagating? Or am I misinterpreting the logs?
Thanks!
The overhead memory is required to be setup for both executor and driver and it should be fraction of driver and executor memory.
spark.yarn.executor.memoryOverhead = executorMemory * 0.10, with minimum of 384
The amount of off-heap memory (in megabytes) to be allocated per
executor. This is memory that accounts for things like VM overheads,
interned strings, other native overheads, etc. This tends to grow with
the executor size (typically 6-10%).
spark.yarn.driver.memoryOverhead = driverMemory * 0.10, with minimum of 384.
The amount of off-heap memory (in megabytes) to be allocated per
driver in cluster mode. This is memory that accounts for things like
VM overheads, interned strings, other native overheads, etc. This
tends to grow with the container size (typically 6-10%).
To learn more about memory optimizations please see Memory Management Overview
Also see following thread on SO Container is running beyond memory limits
Cheers !
The problem in my case was a simple albeit-easy-to-miss one.
Setting driver-level parameters inside code does not work in code. Because by then, it is apparently already too late and the configuration is ignored. I confirmed this with a few tests when I solved it months ago.
Executor parameters however can be set in code. However, keep parameter precedence protocols in mind if you end up setting the same parameter in different places.
Related
I'm naively testing for concurrency in local mode, with the following spark context
SparkSession
.builder
.appName("local-mode-spark")
.master("local[*]")
.config("spark.executor.instances", 4)
.config("spark.executor.cores", 2)
.config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
.config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
.getOrCreate()
and a mapPartitions API call like follows:
import spark.implicits._
val inputDF : DataFrame = spark.read.parquet(inputFile)
val resultDF : DataFrame =
inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF
On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.
That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.
How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?
Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?
Max parallelism
You are already running spark in local mode using .master("local[*]").
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Max memory available to all executors/threads
I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:
setting it in the properties file (default is spark-defaults.conf),
spark.driver.memory 5g
or by supplying configuration setting at runtime
$ ./bin/spark-shell --driver-memory 5g
Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.
Nature of Job
Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.
inputDF.rdd.partitions.size
If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.
Running local mode cannot simulate a production environment for the following reasons.
There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing
tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)
Yes!
Achieving parallelism in local mode is quite possible.
Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.
Increasing executor-memory and executor-cores will not make a difference in this mode.
Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.
You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.
In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.
Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.
Hope this helps!
I'm using pyspark.
Previously i had a similar problem, I collected a lot of data in driver program and didn't realize that lineage of dataframe kept source of data in driver until action was completed. A solution of this problem was to checkpoint dataframe (so lineage became lazy sourced).
But recently I met the same problem, and I don't collect data in driver, but write result to db. Is there some other heavy objects that can cause this error?
I checked all logs, and didn't find any warnings or suspicious messages. Last started action was checkpoint. Of course I can increase driver memory limit, but reason of such consumption is still unknown and I'm afraid I have some bottleneck.
Error message:
Current usage: 4.5 GB of 4.5 GB physical memory used; 6.4 GB of 9.4 GB virtual memory used. Killing container.
And it sends SIGKILL to executors
UPDT:
After some investigations I suppose that problem can be in BroadcastHashJoin. We had a lot of small tables and broadcast join is used to join them, but to broadcast table spark need to materialize it in driver so I disable it by setting spark.sql.autoBroadcastJoinThreshold=-1. And looks like memory consumption on driver decreased, and now AM not raising previous error. But on writing result table in DB(postgres) executors stuck on few last task of last stage. In thread dump of executor i see this stuck task:
Executor task launch worker for task 635633 RUNNABLE Lock(java.util.concurrent.ThreadPoolExecutor$Worker#1287601873}), Monitor(java.io.BufferedOutputStream#1935023563}), Monitor(org.postgresql.core.v3.QueryExecutorImpl#625626056})
It not falling, just waiting, can this be lock from postgres or something ?
I think anyone that has used Spark has ran across OOM errors, and usually the source of the problem can be found easily. However, I am a bit perplexed by this one. Currently, I am trying to save by two different partitions, using the partitionBy function. It looks something like below (made up names):
df.write.partitionBy("account", "markers")
.mode(SaveMode.Overwrite)
.parquet(s"$location$org/$corrId/")
This particular dataframe has around 30gb of data, 2000 accounts and 30 markers. The accounts and markers are close to evenly distributed. I have tried using 5 core nodes and 1 master node driver of amazon's r4.8xlarge (220+ gb of memory) with the default maximize resource allocation setting (which 2x cores for executors and around 165gb of memory). I have also explicitly set the number of cores, executors to different numbers, but had the same issues. When looking at Ganglia, I don't see any excessive memory consumption.
So, it seems very likely that the root cause is the 2gb ByteArrayBuffer issue that can happen on shuffles. I then tried repartitioning the dataframe with various numbers, such as 100, 500, 1000, 3000, 5000, and 10000 with no luck. The job occasionally logs a heap space error, but most of the time gives a node lost error. When looking at the individual node logs, it just seems to suddenly fail with no indication of the problem (which isn't surprising with some oom exceptions).
For dataframe writes, is there a trick to partitionBy's to either get passed the memory heap space error?
I am storing an RDD using a storageLevel = MEMORY_ONLY_SER_2 in case if one executor lost, there is another copy of data.
Then I found something strange:
The Size in Memory of 2xReplicated RDD seems to be the same comparing when I used storageLevel = MEMORY_ONLY_SER (1xReplicated)
Fraction Cached couldn't reach 100% even though I still have a lot of storage memory left.
Am I understanding storageLevel = MEMORY_ONLY_SER_2 correctly? why 2xReplicated doesn't have twice the Size in Memory compared with 1xReplicated ? Thanks!
I guess maybe all your memory for cached is used, so no matter how many replication you used.
I do not know how many memory allocated for every executor, if you allocated a lot, you can increase the value of spark.storage.memoryFraction, the default values is 0.6.
If you just to verify whether the MEMORY_ONLY_SER_2 will cost as twice as the MEMORY_ONLY_SER, you can use a small dataset.
I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.