Support multiple Spark distributions on Yarn cluster - scala

I run multiple spark jobs on a cluster via $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster.
When a new version of Spark goes live I'd like to somehow roll out a new distribution over the cluster alongside with the old one and then gradually migrate all my jobs one by one.
Unfortunately, Spark relies on $SPARK_HOME global variable so I can't figure out how to achieve it.
It would be especially useful when Spark for Scala 2.12 is out.

It is possible to run any number of Spark distributions on YARN cluster. I've done it a lot of times on my MapR cluster, mixing 1-3 different versions, as well as setting up official Apache Spark there.
All you need is to tweak conf/spark-env.sh (rename spark-env.sh.template) and just add a line:
export SPARK_HOME=/your/location/of/spark/spark-2.1.0

Related

Why does Flink use Yarn?

I am taking a deep look inside Flink to see how I can use it on a project and had a question for the creators / high level thinkers... why does Flink use Yarn as the default resource manager?
Was Kubernetes considered? Or is it one of those things where we started on Yarn, it works pretty well...
I have come across many projects and articles that allow Kubernetes and Yarn to work together in cluding the Myraid project that allows yarn to be deployed with Mesos (but I am on Kubernetes...)
I have a very large compute cluster 2000 or so nodes that I use and I want to use the super cool CEP features of Flink feeding off a Kafka infrastructure (also deployed on to this kubernetes environment).
I am looking to understand the reasons behind using Yarn as the resource manager underneath Flink and if would be possible (with some effort and contribution to the project) to make Kubernetes an option alongside Yarn.
Please note - I am new to Yarn - just reading up about it. Also new to Flink and learning about the deployment and scale-out architecture.
Flink is not tied to YARN. It can also run on Apache Mesos and there are also users running it on Kubernetes. In the current version (Flink 1.4.1), there are a few things to consider when running Flink in Kubernetes (see this talk by Patrick Lucas).
The Flink community is also currently working on improving Flink's support for container setups. The effort is called FLIP-6 and will be included in the next release (Flink 1.5.0).

scala spark cassandra installation

How many ways are there to run Spark? If I just declare dependencies in build.sbt, Spark is supposed to be downloaded and works?
But if I want to run Spark locally (download the Spark tar file, winutils...), how can I specify in scala code that I want to run my code against the local Spark and not against the dependencies downloaded in the IntelliJ?
In order to connect Spark to Cassandra, do I need a local installation of Spark? I read somewhere it's not possible to connect from a "programmatically" Spark to a local Cassandra database
1) Spark runs in a slightly strange way, there is your application (the Spark Driver and Executors) and there is the Resource Manager (Spark Master/Workers, Yarn, Mesos or Local).
In your code you can run against the in process manager (local) by specifying the master as local or local[n]. The Local mode requires no installation of Spark as it will be automatically setup in the process you are running. This would be using the dependencies you downloaded.
To run against a Spark Master which is running locally, you use a spark:// url that points at your particular local Spark Master instance. Note that this will cause executor JVMS to start separate from your application necessitating the distribution of application code and dependencies. (Other resource managers have there own identifying urls)
2) You do not need a "Resource Manager" to connect to C* from Spark but this ability is basically for debugging and testing. To do this you would use the local master url. Normal Spark usage should have an external Resource Manager because without an external resource manager the system cannot be distributed.
For some more Spark Cassandra examples see
https://github.com/datastax/SparkBuildExamples

Installing a spark cluster on a hadoop cluster

I am trying to install an apache spark cluster on a hadoop cluster.
I am looking for best pracises in this regard. I am assuming that the spark master needs to be installed on the same machine as the hadoop namenode and the spark slaves on the hadoop datanodes. Also, where all do I need to install scala? Please advise.
If your Hadoop cluster is running YARN just use yarn mode for submitting your applications. That's going to be the easiest method without requiring you to install anything beyond simply downloading the Apache Spark distribution to a client machine. One additional thing you can do is deploy the Spark assembly to HDFS so that you can use the spark.yarn.jar config when you call spark-submit so that the JAR is cached on the nodes.
See here for all the details: http://spark.apache.org/docs/latest/running-on-yarn.html

Should the worker also need Hadoop installed for Spark?

I have setup scala,Hadoop & spark & started the master node successfully.
I just installed scala & spark & started the worker(slave) too. So what I am confused is shouldn't Haddop be setup in worker too for running tasks?
This link from the official Apache Spark shows how to configure a spark cluster. And the requirements are clearly explained here that both scala and hadoop are required.

How to create a Spark Streaming jar that would work in AWS EMR?

I've been developing a Spark Streaming application with Eclipse, and I'm using sbt to run it locally.
Now I want to deploy the application on AWS using a jar, but when I try to use the command package of sbt it creates a jar without all dependencies so when I upload it on AWS it won't work because of Scala being missing.
Is there a way to create a uber-jar with SBT? Am I doing something wrong with the deployment of Spark on AWS?
For creating uber-jar with sbt, use sbt plugin sbt-assembly. For more details about creating uber-jar using sbt-assembly refer the blog post
After creating you can run the assembly jar using java -jar command.
But from Spark-1.0.0 onwards the spark-submit script in Spark’s bin directory is used to launch applications on a cluster for more details refer here
You should really be following Running Spark on EC2 that reads:
The spark-ec2 script, located in Spark’s ec2 directory, allows you to
launch, manage and shut down Spark clusters on Amazon EC2. It
automatically sets up Spark, Shark and HDFS on the cluster for you.
This guide describes how to use spark-ec2 to launch clusters, how to
run jobs on them, and how to shut them down. It assumes you’ve already
signed up for an EC2 account on the Amazon Web Services site.
I've only partially followed the document so I can't comment on how well it's written.
Moreover, according to Shipping Code to the Cluster chapter in the other document:
The recommended way to ship your code to the cluster is to pass it
through SparkContext’s constructor, which takes a list of JAR files
(Java/Scala) or .egg and .zip libraries (Python) to disseminate to
worker nodes. You can also dynamically add new files to be sent to
executors with SparkContext.addJar and addFile.