spark on yarn; how to send metrics to graphite sink? - scala

I am new to spark and we are running spark on yarn. I can run my test applications just fine. I am trying to collect the spark metrics in Graphite. I know what changes to make to metrics.properties file. But how will my spark application see this conf file?
/xxx/spark/spark-0.9.0-incubating-bin-hadoop2/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /xxx/spark/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar --addJars "hdfs://host:port/spark/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar" --class org.apache.spark.examples.Test --args yarn-standalone --num-workers 50 --master-memory 1024m --worker-memory 1024m --args "xx"
Where should I be specifying the metrics.properties file?
I made these changes to it:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=machine.domain.com
*.sink.Graphite.port=2003
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

I have found a different solution to the same problem. It looks like that Spark can also take these metric settings from its config properties. For example the following line from metrics.properties:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
Can also be specified as a Spark property with key spark.metrics.conf.*.sink.graphite.class and value org.apache.spark.metrics.sink.GraphiteSink. You just need to prepend spark.metrics.conf. to each key.
I have ended up putting all these settings in the code like this:
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName)
// etc.
val sc = new spark.SparkContext(sparkConf)
This way I've got the metrics sink set up for both the driver and the executors. I was using Spark 1.6.0.

I struggled with the same thing. I have it working using these flags:
--files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties
It's tricky because the --files flag makes it so your /path/to/metrics.properties file ends up in every executor's local disk space as metrics.properties; AFAIK there's no way to specify more complex directory structure there, or have two files with the same basename.
Related, I filed SPARK-5152 about letting the spark.metrics.conf file be read from HDFS, but that seems like it would require a fairly invasive change, so I'm not holding my breath on that one.

Related

Best Practice for properties in ScalaSpark

I'm starting a project using Hadoop Spark. I'll be developing in Scala.
I'm creating the project from scratch and I was wondering what to do with properties.
I come from a Java Background where I use .properties file and load them at the start. Then I have a class used to access the different value of my properties.
Is this also a good practice in Scala ?
Tried googling, but there isn't anything relating to this.
You can read the properties file in scala similar to Java
import scala.io.Source.fromUrl
val reader = fromURL(getClass.getResource("conf/fp.properties")).bufferedReader()
You can read more about I/O package at Scala Standard Library I/O
If you are looking to provide spark properties then that have different way of doing it e.g. providing them at time when you submit spark job.
Hoping this helps.
Here we do:
scopt.OptionParser to parse command line arguments.
key/value arguments conf are replicated to System.properties
command line arg config-file is used to read config file (using spark context to be able to read from S3/HDFS with custom code path to be able to read from jar resources)
config file parsed using com.typesafe.config.ConfigFactory.
Default configs from resources and from read file are combined using the withFallback mechanism. The order is important since we want typesafe to use values from (2) to override thoses from the files.
There are three ways to determine properties for Spark:
Spark Propertis in SparkConf original spec:
Spark properties control most application settings and are configured
separately for each application. These properties can be set directly
on a SparkConf passed to your SparkContext.
Dynamically Loading Spark Properties original spec, it avoids hard-coding certain configurations in a SparkConf:
./bin/spark-submit --name "My app" --master local[*] --conf spark.eventLog.enabled=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
Overriding spark-defaults.conf — Default Spark Properties File - original spec
I described properties by priority - SparkConf has the highest priority and spark-conf has the lowest priority. For more details check this post
If you want to store all properties in the single place, just you Typesafe Config. Typesafe Config gets rid of using input streams for reading file, it's widely used the approach in scala app.

configuring scheduling pool in spark using zeppelin, scala and EMR

In pyspark I'm able to change to a fair scheduler within zeppelin (on AWS EMR) by doing the following:
conf = sc.getConf()
conf.set('spark.scheduler.allocation.file',
'/etc/spark/conf.dist/fairscheduler.xml.template')
sc.setLocalProperty("spark.scheduler.pool", 'production')
However if I try something similar in a scala cell it then things continue to run in the FIFO pool
val conf = sc.getConf()
conf.set("spark.scheduler.allocation.file",
"/etc/spark/conf.dist/fairscheduler.xml.template")
sc.setLocalProperty("spark.scheduler.pool", "FAIR")
I've tried so many combinations, but nothing has worked. Any advice is appreciated.
I ran into a similar issue with Spark 2.4. In my case, the problem was resolved by removing the default "spark.scheduler.pool" option in my Spark config. It might be that your Scala Spark interpreter is set up with spark.scheduler.pool but your python isn't.
I traced the issue to a bug in Spark - https://issues.apache.org/jira/browse/SPARK-26988. The problem is that if you set the config property "spark.scheduler.pool" in the base configuration, you can't then override it using setLocalProperty. Removing it from the base configuration made it work correctly. See the bug description for more detail.

A master url must be set to your configuration (Spark scala on AWS)

This is what I wrote via intellij. I plan on eventually writing larger spark scala files.
Anyways, I uploaded it on an AWS cluster that I had made. The "master" line, line 11 was "master("local")". I ran into this error
The second picture is the error that was returned by AWS when it did not run successfully. i changed line 11 to "yarn" instead of local (see the first picture for its current state)
It still is returning the same error. I put in the following flags when I uploaded it manually
--steps Type=CUSTOM_JAR,Name="SimpleApp"
It worked two weeks ago. My friend did almost the exact same thing as me. I am not sure why it isn't working.
I am looking for both a brief explanation and an answer. Looks like I need a little more knowledge on how spark works.
I am working with amazon EMR.
I think on the line 9 you are creating SparkContext with "old way" approach in spark 1.6.x and older version - you need to set master in default configuration file (usually location conf/spark-defaults.conf) or pass it to spark-submit (it is required in new SparkConf())...
On line 10 you are creating "spark" context with SparkSesion which is approach in spark 2.0.0. So in my opinion your problem is line num. 9 and I think you should remove it and work with SparkSesion or set reqiered configuration for SparkContext In case when you need sc.
You can access to sparkContext with sparkSession.sparkContext();
If you still want to use SparkConf you need to define master programatically:
val sparkConf = new SparkConf()
.setAppName("spark-application-name")
.setMaster("local[4]")
.set("spark.executor.memory","512m");
or with declarative approach in conf/spark-defaults.conf
spark.master local[4]
spark.executor.memory 512m
or simply at runtime:
./bin/spark-submit --name "spark-application-name" --master local[4] --executor-memory 512m your-spark-job.jar
Try using the below code:
val spark = SparkSession.builder().master("spark://ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com:xxxx").appName("example").getOrCreate()
you need to provide the proper link to your aws cluster.

Spark difference or conflicts between setMaster in app conf and --master flag on sparkSubmit

I'm trying to understand the importance of setting the master property when running a spark application.
The cluster location is at the default port of 7077. I'm running this app from a testmachine where it will hit an s3 bucket.
Currently spark configuration in the app reads:
val sparkConf = new SparkConf()
.setMaster("spark://127.0.0.1:7077")
but I'm also setting the flag on the command line with spark submit:
--master spark://127.0.0.1:7077
So, does having both of these set cause problems? Does one get overridden by the other? Are they both necessary?
So, does having both of these set cause problems? Does one get
overridden by the other? Are they both necessary?
The Spark Configuration page is very clear (emphasis mine):
Any values specified as flags or in the properties file will be passed
on to the application and merged with those specified through
SparkConf. Properties set directly on the SparkConf take highest
precedence, then flags passed to spark-submit or spark-shell, then
options in the spark-defaults.conf file. A few configuration keys have
been renamed since earlier versions of Spark; in such cases, the older
key names are still accepted, but take lower precedence than any
instance of the newer key.

How to restrict processing to specified number of cores in spark standalone

We have tried using various combinations of settings - but mpstat is showing that all or most cpu's are always being used (on a single 8 core system)
Following have been tried:
set master to:
local[2]
send in
conf.set("spark.cores.max","2")
in the spark configuration
Also using
--total-executor-cores 2
and
--executor-cores 2
In all cases
mpstat -A
shows that all of the CPU's are being used - and not just by the master.
So I am at a loss presently. We do need to limit the usage to a specified number of cpu's.
I had the same problem with memory size and I wanted to increase it but none of the above worked for me as well. Based on this user post I was able to resolve my problem and I think this should also work for number of cores:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf().set("spark.cores.max", "2"))
# Create new context
sc = SparkContext(conf=conf)
Hope this helps you. And please, if you have resolved your problem, send your solution as answer for this post so we can all benefit from it :)
Cheers
Apparently spark standalone ignores the spark.cores.max setting. That setting does work in yarn.