How to set hadoop configuration values from pyspark - scala

The Scala version of SparkContext has the property
sc.hadoopConfiguration
I have successfully used that to set Hadoop properties (in Scala)
e.g.
sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")
However the python version of SparkContext lacks that accessor. Is there any way to set Hadoop configuration values into the Hadoop Configuration used by the PySpark context?

sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal')
should work

You can set any Hadoop properties using the --conf parameter while submitting the job.
--conf "spark.hadoop.fs.mapr.trace=debug"
Source: https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L105

I looked into the PySpark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs:
fileLines = sc.newAPIHadoopFile('dev/*',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'}
).count()

Related

pyspark equivalent of the flag "isLocal" from Scala API

Scala has a flag isLocal. If this is true then we know that Spark is running in local mode, else it is running on a cluster. Is there any pyspark alternative to this? Or do we simply check sc.master?
It's not available in the Python API, but you can call isLocal on the Java SparkContext as:
sc._jsc.isLocal()

Best Practice for properties in ScalaSpark

I'm starting a project using Hadoop Spark. I'll be developing in Scala.
I'm creating the project from scratch and I was wondering what to do with properties.
I come from a Java Background where I use .properties file and load them at the start. Then I have a class used to access the different value of my properties.
Is this also a good practice in Scala ?
Tried googling, but there isn't anything relating to this.
You can read the properties file in scala similar to Java
import scala.io.Source.fromUrl
val reader = fromURL(getClass.getResource("conf/fp.properties")).bufferedReader()
You can read more about I/O package at Scala Standard Library I/O
If you are looking to provide spark properties then that have different way of doing it e.g. providing them at time when you submit spark job.
Hoping this helps.
Here we do:
scopt.OptionParser to parse command line arguments.
key/value arguments conf are replicated to System.properties
command line arg config-file is used to read config file (using spark context to be able to read from S3/HDFS with custom code path to be able to read from jar resources)
config file parsed using com.typesafe.config.ConfigFactory.
Default configs from resources and from read file are combined using the withFallback mechanism. The order is important since we want typesafe to use values from (2) to override thoses from the files.
There are three ways to determine properties for Spark:
Spark Propertis in SparkConf original spec:
Spark properties control most application settings and are configured
separately for each application. These properties can be set directly
on a SparkConf passed to your SparkContext.
Dynamically Loading Spark Properties original spec, it avoids hard-coding certain configurations in a SparkConf:
./bin/spark-submit --name "My app" --master local[*] --conf spark.eventLog.enabled=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
Overriding spark-defaults.conf — Default Spark Properties File - original spec
I described properties by priority - SparkConf has the highest priority and spark-conf has the lowest priority. For more details check this post
If you want to store all properties in the single place, just you Typesafe Config. Typesafe Config gets rid of using input streams for reading file, it's widely used the approach in scala app.

How to set spark.local.dir property from spark shell?

I'm trying to set spark.local.dir from spark-shell using sc.getconf.set("spark.local.dir","/temp/spark"), But it is not working. Is there any other way to set this property from sparkshell.
You can't do it from inside the shell - since the Spark context was already created, so the local dir was already set (and used). You should pass it as parameter when starting the shell:
./spark-shell --conf spark.local.dir=/temp/spark
#Tzach Zohar solution seems to be the right answer.
However, if you insist to set spark.local.dir from spark-shell you can do it:
1) close the current spark context
sc.stop()
2) updated the sc configuration, and restart it.
The updated code was kindly provided by #Tzach-Zohar:
SparkSession.builder.config(sc.getConf).config("spark.local.‌​dir","/temp/spark").‌​getOrCreate())
#Tzach Zohar note: "but you get a WARN SparkContext: Use an existing SparkContext, some configuration may not take effect, which suggests this isn't the recommended path to take.

How to stop Spark from loading defaults?

When I do a spark-submit, the defaults conf set up in the SPARK_HOME directory is found and loaded into the System properties.
I want to stop the defaults conf from being loaded, and just get the command line arguments, so that I may re-order how spark is configured before creating my spark context.
Is this possible?
There are a couple ways to modify configurations.
According to the spark docs, you can modify configs at runtime with flags (http://spark.apache.org/docs/latest/configuration.html):
The Spark shell and spark-submit tool support two ways to load
configurations dynamically. The first are command line options, such
as --master, as shown above. spark-submit can accept any Spark
property using the --conf flag... Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf.
which means you can kick off your jobs like this:
./bin/spark-submit --conf spark.eventLog.enabled=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
OR, you can go edit the spark-defaults.conf and not have to pass additional flags in your spark-submit command.
Here's a solution I found acceptable for my issue:
Create a blank "blank.conf" file, and supply it to spark using --properties
${SPARK_HOME}/bin/spark-submit --master local --properties-file "blank.conf" # etc
Spark will use the conf in its configuration instead of finding the defaults conf. You can then manually load up the defaults conf later, before creating your SparkContext, if that's your desire.

How we can deploy my existing kafka - spark - cassandra project to kafka - dataproc -cassandra in google-cloud-platform?

My existing project is kafka-spark-cassandra. Now I have got gcp account and have to migrate spark jobs to dataproc. In my existing spark jobs parameters like masterip,memory,cores etc are passed through command line which is triggerd by a linux shell script and create new sparkConf.
val conf = new SparkConf(true)
.setMaster(master)
.setAppName("xxxx")
.setJars(List(path+"/xxxx.jar"))
.set("spark.executor.memory", memory)
.set("spark.cores.max",cores)
.set("spark.scheduler.mode", "FAIR")
.set("spark.cassandra.connection.host", cassandra_ip)
1) How this can configure in dataproc?
2) Wheather there will be any compatibility issue b/w Spark 1.3(existing project) and Spark 1.6 provided by dataproc ? How it can resolve?
3) Is there any other connector needed for dataproc to get connected with Kafka and cassandra? I couldnt find any.
1) When submitting a job, you can specify arguments and properties: https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/spark. When determining which properties to set, keep in mind that Dataproc submits Spark jobs in yarn-client mode.
In general, this means you should avoid specifying master directly in code, instead letting it come from the spark.master value inside of spark-defaults.conf, and then your local setup would have that config set to local while Dataproc would automatically have it set to yarn-client with the necessary yarn config settings alongside it.
Likewise, keys like spark.executor.memory, etc., should make use of Spark's first-class command-line if running spark-submit directly:
spark-submit --conf spark.executor.memory=42G --conf spark.scheduler.mode=FAIR
or if submitting to Dataproc with gcloud:
gcloud dataproc jobs submit spark \
--properties spark.executor.memory=42G,spark.scheduler.mode=FAIR
You'll also want to look at the equivalent --jars flags for jars instead of specifying it in code.
2) When building your project to deploy, ensure you exclude spark (e.g., in maven, mark spark as provided). You may hit compatibility issues, but without knowing all APIs in use, I can't say one way or the other. The simplest way to find out is to bump Spark to 1.6.1 in your build config and see what happens.
In general Spark core is considered GA and should thus be mostly backwards compatible in 1.X versions, but the compatibility guidelines didn't apply yet to subprojects like mllib and SparkSQL, so if you use those you're more likely to need to recompile against the newer Spark version.
3) Connectors should either be included in a fat jar, specified as --jars, or installed onto the cluster at creation via initialization actions.