Starting spark in standalone client mode on 10 nodes cluster using Spark-2.1.0-SNAPSHOT.
9 nodes are workers, 10th is master and driver. Each 256GB of memory.
I'm having difficuilty to utilize my cluster fully.
I'm setting up memory limit for executors and driver to 200GB using following parameters to spark-shell:
spark-shell --executor-memory 200g --driver-memory 200g --conf spark.driver.maxResultSize=200g
When my application starts I can see those values set as expected both in console and in spark web UI /environment/ tab.
But when I go to /executors/ tab then I see that my nodes got only 114.3GB storage memory assigned, see screen below.
Total memory shown here is then 1.1TB while I would expect to have 2TB. I double checked that other processes were not using the memory.
Any idea what is the source of that discrepancy? Did I miss some setting? Is it a bug in /executors/ tab or spark engine?
You're fully utilizing the memory, but here you are only looking at the storage portion of the memory. By default, the storage portion is 60% of the total memory.
From Spark Docs
Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster.
As of Spark 1.6, the execution memory and the storage memory is shared, so it's unlikely that you would need to tune the memory.fraction parameter.
If you're using yarn, the main page of the resource manager's "Memory Used" and "Memory Total" will signify the total memory usage.
Related
I have a spark job running on a EMR cluster with following cluster configuration:
Master : 1 : m4.2xlarge: 32 GiB of memory, 8 vCPUs. Core : 2 :
m4.2xlarge: 32 GiB of memory, 8 vCPUs. Task Nodes : Upto 52 :
r4.2xlarge: 61 GiB of memory, 8 vCPUs.
Here is my spark submit configuration based on this blog:
1: https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/ .
spark.yarn.executor.memory=19g
spark.executor.cores=3
spark.yarn.driver.memoryOverhead=2g
spark.executor.memoryOverhead=2g
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=7
spark.dynamicAllocation.initialExecutors=7
spark.dynamicAllocation.maxExecutors=1000
spark.shuffle.service.enabled=true
spark.yarn.maxAttempts=1
I am running a cross join of 2 datasets for an use case. And I am trying to utilize every bit of memory and CPU available on cluster that I can using above settings. I am able to successfully utilize all memory available in the cluster but not CPU. I see that even though 432 cores are available, but spark job is able to utilize only 103 cores are being used as shown in screenshot. I see same behaviour when job is run in yarn-client mode (zeppelin) or yarn-cluster mode.
I am not sure what setting is missing or is incorrect. Any suggestions to resolve this is appreciated.
If you are seeing this in YARN ui probably you have to add this in yarn-site.xml
yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
I had the same confusion. Actually while using DefaultResourceCalculator in Yarn UI its only calculates memory usage, behind the scene it may have been using more than 1 core but you will see only 1 core used. On the other hand DominantResourceCalculator calculates both core and memory for resource allocation and shows actual number of core and memory.
You can also enable ganglia or see EMR metrics for more details.
Total Noob here, I installed Cloudera Manager on single node on aws ec2. I followed the install wizard but when I try running
spark-shell or pyspark I get the following error message:
ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (1024+384
MB) is above the max threshold (1024 MB) of this cluster! Please check
the values of 'yarn.scheduler.maximum-allocation-mb' and/or
'yarn.nodemanager.resource.memory-mb'.
Can somebody explain to me what is going on or where to begin reading? Total noob, here so any help or direction is greatly appreciated
The required executor memory is above the maximum threshold. You need to increase the YARN memory.
The values of yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb both live in the config file yarn-site.xml which is managed by Cloudera Manager in your case.
yarn.nodemanager.resource.memory-mb is the amount of physical memory, in MB, that can be allocated for containers.
yarn.scheduler.maximum-allocation-mb is the maximum memory in mb that cab be allocated per yarn container. The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this won't take effect, and will get capped to this value.
You can read more on the definitions and default values here: https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
In the Cloudera Manager user interface, go to Yarn service > Configurations > Search and increase the values of them.
Restart YARN for the changes to take effect.
Dataproc is supposed to fit in two Executors per worker (or yarn NodeManager) with each one getting half the cores and half the memory.
And it does work that way.
However, if we override a setting, say spark.yarn.executor.memoryOverhead=4096
then it only creates one Executor per worker. Half the cores and memory of the clusters are not utilized. And no matter how we play around with spark.executor.memory or spark.executor.cores, it still doesn't spin up enough executors to utilize all cluster resources.
How to make dataproc still create 2 executors per worker? The yarn overhead is deducted out of the executor memory, so it should still be able to fit in 2 executors, shouldn't it?
When executing in YARN, Spark will request containers with memory sized as spark.executor.memory + spark.yarn.executor.memoryOverhead. If you're adding to memoryOverhead, you will want to subtract an equal amount from spark.executor.memory to preserve the same container packing characteristics.
When I start start spark shell using :
./bin/spark-shell --master spark://IP:7077 --executor-memory 4G
Then no memory is allocated to Spark :
If however if I just use default :
./bin/spark-shell --master spark://IP:7077
then memory is allocated :
How can I use max available memory in spark shell ? In this case
845MB+845MB+2.8GB = 4.49GB
Update : It appears Spark will just allocate to each node the max available memory of the node with the least amount of memory. So if I use :
./bin/spark-shell --master spark://IP:7077 --executor-memory 845M
then 2 nodes are fully allocated but node with 2.8GB is not fully allocated :
So question now becomes can Spark be configured so that each node uses it's max free memory ?
If your cluster consists of machines of different types, the limit of memory allocated to all executors will be taken from the machine with least memory size.
Spark is not that smart of allocating more memory to one specific executor because it runs on a bigger machine. That's why it's recommended to use homogenous hardware to build spark/hadoop cluster.
Secondly, try to avoid allocating 100% node's memory to executors because it will make GC collection slower & also other daemons needs memory too. My suggestion would be to spare at least 5% memory size on every single machine.
For more insights I recommend this article http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Hope it helps.
In spark-env.sh, it's possible to configure the following environment variables:
# - SPARK_WORKER_MEMORY, to set how much memory to use (e.g. 1000m, 2g)
export SPARK_WORKER_MEMORY=22g
[...]
# - SPARK_MEM, to change the amount of memory used per node (this should
# be in the same format as the JVM's -Xmx option, e.g. 300m or 1g)
export SPARK_MEM=3g
If I start a standalone cluster with this:
$SPARK_HOME/bin/start-all.sh
I can see at the Spark Master UI webpage that all the workers start with only 3GB RAM:
-- Workers Memory Column --
22.0 GB (3.0 GB Used)
22.0 GB (3.0 GB Used)
22.0 GB (3.0 GB Used)
[...]
However, I specified 22g as SPARK_WORKER_MEMORY in spark-env.sh
I'm somewhat confused by this. Probably I don't understand the difference between "node" and "worker".
Can someone explain the difference between the two memory settings and what I might have done wrong?
I'm using spark-0.7.0. See also here for more configuration info.
A standalone cluster can host multiple Spark clusters (each "cluster" is tied to a particular SparkContext). i.e. you can have one cluster running kmeans, one cluster running Shark, and another one running some interactive data mining.
In this case, the 22GB is the total amount of memory you allocated to the Spark standalone cluster, and your particular instance of SparkContext is using 3GB per node. So you can create 6 more SparkContext's using up to 21GB.