How to limit pyspark ressources - pyspark

I'm running pyspark in my local machine, and I want to limit the number of used cores and used memory (I've 8 cores and 16Gb of memory)
I don't know how to do this, I've tried to add these lines to my code, but the process is still greedy.
from pyspark import SparkContext, SparkConf
conf = (SparkConf().setMaster("local[4]")
.set("spark.executor.cores", "4")
.set("spark.cores.max", "4")
.set('spark.executor.memory', '6g')
)
sc = SparkContext(conf=conf)
rdd = sc.parallelize(input_data, numSlices=4)
map_result = rdd.map(map_func)
map_result.reduce(reduce_func)
Why do the confs are not applied ?

This maybe happennig due to "precedence" in configurations. Since Spark allows different ways to set configuration parameters. In the documentation we can see:
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
For more info: Spark Documentation
So I suggest reviewing spark-submit parameters and configuration files.
Hope it helps.

Related

What is the difference between defining Spark Master in the CLI vs defining 'master' in the Spark application code?

What is the difference between Spark-submit "--master" defined in the CLI and spark application code, defining the master?
In Spark we can specify the master URI in either the application code like below:
Or we can specify the master URI in the spark-submit as an argument to a parameter, like below:
Does one take precendence over the other? Do they have to agree contractually, so I have two instances of the same URI referenced in the program spark-submit and the spark application code, creating the SparkSession? Will one override the other? What will the SparkSession do differently with the master argument, and what will the spark-submit master parameter do differently?
Any help would be greatly appreciated. Thank you!
To quote the official documentation
The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default, it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.
Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.
If you are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the --verbose option.
So all are valid options, and there is a well defined hierarchy which defines precedence if the same option is set in multiple place. From highest to lowest:
Explicit settings in the application.
Commandline arguments.
Options from the configuration files.
From the Spark documentation:
In general,
configuration values explicitly set on a SparkConf take the highest precedence,
then flags passed to spark-submit,
then values in the defaults file.
It strikes me the most flexible approach is flags passed to spark-submit.

Different outputs per number of partition in spark

I run spark code in my local machine and cluster.
I create SparkContext object for local machine with following code:
val sc = new SparkContext("local[*]", "Trial")
I create SparkContext object for cluster with following code:
val spark = SparkSession.builder.appName(args(0)+" "+args(1)).getOrCreate()
val sc = spark.sparkContext
and I set the number of partition as 4 for local machine and cluster with following code
val dataset = sc.textFile("Dataset.txt", 4)
In my cluster, I created 5 workers. One of them is driver node, rest of them run as worker.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?
I create SparkContext object for local machine with following code
and
I create SparkContext object for cluster with following code:
It appears that you may have defined two different environments for sc and spark as you define local[*] explicitly for sc while taking some default value for spark (that may read external configuration files or take so-called master URL from spark-submit).
These may be different that may affect what you use.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?
Dataset.txt you process in local vs cluster environments are different and hence the difference in the results. I'd strongly recommend using HDFS or some other shared file system to avoid such "surprises" in the future.

Spark difference or conflicts between setMaster in app conf and --master flag on sparkSubmit

I'm trying to understand the importance of setting the master property when running a spark application.
The cluster location is at the default port of 7077. I'm running this app from a testmachine where it will hit an s3 bucket.
Currently spark configuration in the app reads:
val sparkConf = new SparkConf()
.setMaster("spark://127.0.0.1:7077")
but I'm also setting the flag on the command line with spark submit:
--master spark://127.0.0.1:7077
So, does having both of these set cause problems? Does one get overridden by the other? Are they both necessary?
So, does having both of these set cause problems? Does one get
overridden by the other? Are they both necessary?
The Spark Configuration page is very clear (emphasis mine):
Any values specified as flags or in the properties file will be passed
on to the application and merged with those specified through
SparkConf. Properties set directly on the SparkConf take highest
precedence, then flags passed to spark-submit or spark-shell, then
options in the spark-defaults.conf file. A few configuration keys have
been renamed since earlier versions of Spark; in such cases, the older
key names are still accepted, but take lower precedence than any
instance of the newer key.

How to restrict processing to specified number of cores in spark standalone

We have tried using various combinations of settings - but mpstat is showing that all or most cpu's are always being used (on a single 8 core system)
Following have been tried:
set master to:
local[2]
send in
conf.set("spark.cores.max","2")
in the spark configuration
Also using
--total-executor-cores 2
and
--executor-cores 2
In all cases
mpstat -A
shows that all of the CPU's are being used - and not just by the master.
So I am at a loss presently. We do need to limit the usage to a specified number of cpu's.
I had the same problem with memory size and I wanted to increase it but none of the above worked for me as well. Based on this user post I was able to resolve my problem and I think this should also work for number of cores:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf().set("spark.cores.max", "2"))
# Create new context
sc = SparkContext(conf=conf)
Hope this helps you. And please, if you have resolved your problem, send your solution as answer for this post so we can all benefit from it :)
Cheers
Apparently spark standalone ignores the spark.cores.max setting. That setting does work in yarn.

spark on yarn; how to send metrics to graphite sink?

I am new to spark and we are running spark on yarn. I can run my test applications just fine. I am trying to collect the spark metrics in Graphite. I know what changes to make to metrics.properties file. But how will my spark application see this conf file?
/xxx/spark/spark-0.9.0-incubating-bin-hadoop2/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /xxx/spark/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar --addJars "hdfs://host:port/spark/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar" --class org.apache.spark.examples.Test --args yarn-standalone --num-workers 50 --master-memory 1024m --worker-memory 1024m --args "xx"
Where should I be specifying the metrics.properties file?
I made these changes to it:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=machine.domain.com
*.sink.Graphite.port=2003
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I have found a different solution to the same problem. It looks like that Spark can also take these metric settings from its config properties. For example the following line from metrics.properties:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
Can also be specified as a Spark property with key spark.metrics.conf.*.sink.graphite.class and value org.apache.spark.metrics.sink.GraphiteSink. You just need to prepend spark.metrics.conf. to each key.
I have ended up putting all these settings in the code like this:
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName)
// etc.
val sc = new spark.SparkContext(sparkConf)
This way I've got the metrics sink set up for both the driver and the executors. I was using Spark 1.6.0.
I struggled with the same thing. I have it working using these flags:
--files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties
It's tricky because the --files flag makes it so your /path/to/metrics.properties file ends up in every executor's local disk space as metrics.properties; AFAIK there's no way to specify more complex directory structure there, or have two files with the same basename.
Related, I filed SPARK-5152 about letting the spark.metrics.conf file be read from HDFS, but that seems like it would require a fairly invasive change, so I'm not holding my breath on that one.