How we can deploy my existing kafka - spark - cassandra project to kafka - dataproc -cassandra in google-cloud-platform? - apache-kafka

My existing project is kafka-spark-cassandra. Now I have got gcp account and have to migrate spark jobs to dataproc. In my existing spark jobs parameters like masterip,memory,cores etc are passed through command line which is triggerd by a linux shell script and create new sparkConf.
val conf = new SparkConf(true)
.setMaster(master)
.setAppName("xxxx")
.setJars(List(path+"/xxxx.jar"))
.set("spark.executor.memory", memory)
.set("spark.cores.max",cores)
.set("spark.scheduler.mode", "FAIR")
.set("spark.cassandra.connection.host", cassandra_ip)
1) How this can configure in dataproc?
2) Wheather there will be any compatibility issue b/w Spark 1.3(existing project) and Spark 1.6 provided by dataproc ? How it can resolve?
3) Is there any other connector needed for dataproc to get connected with Kafka and cassandra? I couldnt find any.

1) When submitting a job, you can specify arguments and properties: https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/spark. When determining which properties to set, keep in mind that Dataproc submits Spark jobs in yarn-client mode.
In general, this means you should avoid specifying master directly in code, instead letting it come from the spark.master value inside of spark-defaults.conf, and then your local setup would have that config set to local while Dataproc would automatically have it set to yarn-client with the necessary yarn config settings alongside it.
Likewise, keys like spark.executor.memory, etc., should make use of Spark's first-class command-line if running spark-submit directly:
spark-submit --conf spark.executor.memory=42G --conf spark.scheduler.mode=FAIR
or if submitting to Dataproc with gcloud:
gcloud dataproc jobs submit spark \
--properties spark.executor.memory=42G,spark.scheduler.mode=FAIR
You'll also want to look at the equivalent --jars flags for jars instead of specifying it in code.
2) When building your project to deploy, ensure you exclude spark (e.g., in maven, mark spark as provided). You may hit compatibility issues, but without knowing all APIs in use, I can't say one way or the other. The simplest way to find out is to bump Spark to 1.6.1 in your build config and see what happens.
In general Spark core is considered GA and should thus be mostly backwards compatible in 1.X versions, but the compatibility guidelines didn't apply yet to subprojects like mllib and SparkSQL, so if you use those you're more likely to need to recompile against the newer Spark version.
3) Connectors should either be included in a fat jar, specified as --jars, or installed onto the cluster at creation via initialization actions.

Related

A master url must be set to your configuration (Spark scala on AWS)

This is what I wrote via intellij. I plan on eventually writing larger spark scala files.
Anyways, I uploaded it on an AWS cluster that I had made. The "master" line, line 11 was "master("local")". I ran into this error
The second picture is the error that was returned by AWS when it did not run successfully. i changed line 11 to "yarn" instead of local (see the first picture for its current state)
It still is returning the same error. I put in the following flags when I uploaded it manually
--steps Type=CUSTOM_JAR,Name="SimpleApp"
It worked two weeks ago. My friend did almost the exact same thing as me. I am not sure why it isn't working.
I am looking for both a brief explanation and an answer. Looks like I need a little more knowledge on how spark works.
I am working with amazon EMR.
I think on the line 9 you are creating SparkContext with "old way" approach in spark 1.6.x and older version - you need to set master in default configuration file (usually location conf/spark-defaults.conf) or pass it to spark-submit (it is required in new SparkConf())...
On line 10 you are creating "spark" context with SparkSesion which is approach in spark 2.0.0. So in my opinion your problem is line num. 9 and I think you should remove it and work with SparkSesion or set reqiered configuration for SparkContext In case when you need sc.
You can access to sparkContext with sparkSession.sparkContext();
If you still want to use SparkConf you need to define master programatically:
val sparkConf = new SparkConf()
.setAppName("spark-application-name")
.setMaster("local[4]")
.set("spark.executor.memory","512m");
or with declarative approach in conf/spark-defaults.conf
spark.master local[4]
spark.executor.memory 512m
or simply at runtime:
./bin/spark-submit --name "spark-application-name" --master local[4] --executor-memory 512m your-spark-job.jar
Try using the below code:
val spark = SparkSession.builder().master("spark://ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com:xxxx").appName("example").getOrCreate()
you need to provide the proper link to your aws cluster.

spark-jobserver - managing multiple EMR clusters

I have a production environment that consists of several (persistent and ad-hoc) EMR Spark clusters.
I would like to use one instance of spark-jobserver to manage the job JARs for this environment in general, and be able to specify the intended master right when I POST /jobs, and not permanently in the config file (using master = "local[4]" configuration key).
Obviously I would prefer to have spark-jobserver running on a standalone machine, and not on any of the masters.
Is this somehow possible?
You can write a SparkMasterProvider
https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/spark.jobserver/util/SparkMasterProvider.scala
A complex example is here https://github.com/spark-jobserver/jobserver-cassandra/blob/master/src/main/scala/spark.jobserver/masterLocators/dse/DseSparkMasterProvider.scala
I think all you have to do is write one that will return the config input as spark master, that way you can pass it as part of job config.

Best practice for keeping local vs. test vs. production configuration properties in Spark/Scala

We have some input directories which we use to load files and process in Spark SQL.
The directories are, of course, different on local machines vs. test vs. production.
What is the best way to parametrize these,
so that we can build, run tests
and deploy with sbt without having to
a) change much of the configuration settings by hand,
b) have developers use their own configuration settings,
c) have build target different deployments?
You can choose from many options:
Pass as arguments in spark-submit
Very simple, but won't scale, if the number of settings increases
I'd only use it to pass a single argument which defines the environment (dev, test, prod, ...)
Use property files
Use an argument passed to spark-submit to speficy the file to be read from HDFS (example: hdfs://localhost:9000/conf/dev.properties)
Store in a JSON file, and read in as DataFrame
If you want to query the configuration using SQL
Store in a RDBMS, and read in as DataFrame
If you have access to a running RDBMS (or you can install one)
Possibly there is already a RDBMS, if you have a Hive metastore backed by one)
Offers batch updates/deletes using SQL
Might require some effort if you want high availability
Use a distributed configuration service
If you have access to a running ZooKeeper et. al.
In case of ZooKeeper:
You can update values
You can register call backs, if a values changes
Use a key/value store
If you have access to Infinispan, Redis, Memcached et. al.
For example, Infinispan provides a distributed, replicated, persistent java.util.Map
There are certainly other options (LDAP for example), but I'd opt for properties: Immutable configuration values are normally sufficient, it possibly does not introduce new dependencies, and it's easy to manage from the command line and/or an sbt task.
Well if you don't want to have your own config or settings package that your Spark application will simply read off then you could have the application take in the file path as a parameter when submitting the spark app.
For example:
./bin/spark-submit \
--class org.YourApp \
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/yourApp.jar \
/path/to/directory
And if you use Oozie then you can change the path parameter in the XML as needed.

Is there a way to submit a Spark application to a cluster without creating a jar?

I am building a service which takes in a string of Spark code to execute on a cluster. Is there any way for me to set the Spark context to the cluster and execute without building a jar and submitting it?
Indeed, you can use the spark shell, or look at something like the IBM Spark Kernel, Zeppelin, etc. to have a long running Spark Context you can submit code to and have it run. As you are almost certainly already aware of, be very careful with accepting strings and executing them on the cluster (e.g. only from a trusted source).

spark on yarn; how to send metrics to graphite sink?

I am new to spark and we are running spark on yarn. I can run my test applications just fine. I am trying to collect the spark metrics in Graphite. I know what changes to make to metrics.properties file. But how will my spark application see this conf file?
/xxx/spark/spark-0.9.0-incubating-bin-hadoop2/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /xxx/spark/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar --addJars "hdfs://host:port/spark/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar" --class org.apache.spark.examples.Test --args yarn-standalone --num-workers 50 --master-memory 1024m --worker-memory 1024m --args "xx"
Where should I be specifying the metrics.properties file?
I made these changes to it:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=machine.domain.com
*.sink.Graphite.port=2003
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I have found a different solution to the same problem. It looks like that Spark can also take these metric settings from its config properties. For example the following line from metrics.properties:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
Can also be specified as a Spark property with key spark.metrics.conf.*.sink.graphite.class and value org.apache.spark.metrics.sink.GraphiteSink. You just need to prepend spark.metrics.conf. to each key.
I have ended up putting all these settings in the code like this:
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName)
// etc.
val sc = new spark.SparkContext(sparkConf)
This way I've got the metrics sink set up for both the driver and the executors. I was using Spark 1.6.0.
I struggled with the same thing. I have it working using these flags:
--files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties
It's tricky because the --files flag makes it so your /path/to/metrics.properties file ends up in every executor's local disk space as metrics.properties; AFAIK there's no way to specify more complex directory structure there, or have two files with the same basename.
Related, I filed SPARK-5152 about letting the spark.metrics.conf file be read from HDFS, but that seems like it would require a fairly invasive change, so I'm not holding my breath on that one.