How to run a Flink Job on a remote YARN cluster - scala

I have some issues to deploy a Flink job remotely through the scala API.
I have no problem with launching a Yarn session on my cluster and then run my job in command line with a jar.
What I want is to directly run my job with my IDE. How to do it in scala ?
val env = ExecutionEnvironment.createRemoteEnvironment("mymaster", 6123, "myjar-with-dependencies.jar")
This is not working, and I do realize that I am not declaring any YARN deployment with it.
Any help ?

Flink does currently (March 2017, Flink 1.2) not allow to deploy on YARN programmatically through an ExecutionEnvironment.
You could look into Flink's internal, undocumented APIs for deploying it on YARN, and then submit through the remote env.

Related

Unable to use the google.cloud.sql.connector module in Google composer

I am trying to schedule a dataflow pipeline job to read content from a CloudSQL SQLServer instance and write it to the BigQuery table. I'm using the google.cloud.sql.connector[pytds] for setting connection. The manual dataflow job runs successfully when I run it through the Google cloud shell. The airflow version(using Google cloud composer) fails, giving Name error.
'NameError: name 'Connector' is not defined'
I have enabled the save-main-session option. Also, I have mentioned the connector module in the py_requirements option and it is being installed(as per the airflow logs).
py_requirements=['apache-beam[gcp]==2.41.0','cloud-sql-python-connector[pytds]==0.6.1','pyodbc==4.0.34','SQLAlchemy==1.4.41','pymssql==2.2.5','sqlalchemy-pytds==0.3.4','pylint==2.15.4']
[2022-11-02 07:40:53,308] {process_utils.py:173} INFO - Collecting cloud-sql-python-connector[pytds]==0.6.1
[2022-11-02 07:40:53,333] {process_utils.py:173} INFO - Using cached cloud_sql_python_connector-0.6.1-py2.py3-none-any.whl (28 kB)
But it seems the import is not working.
You have to install the PyPi packages in Cloud Composer nodes, you have a tab in the GUI and Composer page :
Add all the needed packages for your Dataflow job in Composer via this page, except Apache Beam and Apache Beam GCP because Beam and Google Cloud dependencies are already installed in Cloud Composer.
Cloud Composer is the runner of your Dataflow job and the runner will instantiate the job. To be able to instantiate the job correctly, the runner needs to have the dependencies installed.
Then the Dataflow job in execution mode, will use the given py_requirements or setup.py file in the workers.
py_requirements or setup.py must also contains the needed Packages to execute the Dataflow job.

Support multiple Spark distributions on Yarn cluster

I run multiple spark jobs on a cluster via $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster.
When a new version of Spark goes live I'd like to somehow roll out a new distribution over the cluster alongside with the old one and then gradually migrate all my jobs one by one.
Unfortunately, Spark relies on $SPARK_HOME global variable so I can't figure out how to achieve it.
It would be especially useful when Spark for Scala 2.12 is out.
It is possible to run any number of Spark distributions on YARN cluster. I've done it a lot of times on my MapR cluster, mixing 1-3 different versions, as well as setting up official Apache Spark there.
All you need is to tweak conf/spark-env.sh (rename spark-env.sh.template) and just add a line:
export SPARK_HOME=/your/location/of/spark/spark-2.1.0

Installing a spark cluster on a hadoop cluster

I am trying to install an apache spark cluster on a hadoop cluster.
I am looking for best pracises in this regard. I am assuming that the spark master needs to be installed on the same machine as the hadoop namenode and the spark slaves on the hadoop datanodes. Also, where all do I need to install scala? Please advise.
If your Hadoop cluster is running YARN just use yarn mode for submitting your applications. That's going to be the easiest method without requiring you to install anything beyond simply downloading the Apache Spark distribution to a client machine. One additional thing you can do is deploy the Spark assembly to HDFS so that you can use the spark.yarn.jar config when you call spark-submit so that the JAR is cached on the nodes.
See here for all the details: http://spark.apache.org/docs/latest/running-on-yarn.html

How to run a Kafka connect worker in YARN?

I'm playing with Kafka-Connect. I've got the HDFS connector working both in stand-alone mode and distributed mode.
They advertise that the workers (which are responsible for running the connectors) can be managed via YARN However, I haven't seen any documentation that describes how to achieve this goal.
How do I go about getting YARN to execute workers? If there is no specific approach, are there generic how-to's as to how to get an application to run within YARN?
I've used YARN with SPARK using spark-submit however, I cannot figure out how to get the connector to run in YARN.
You can theoretically run anything on YARN, even a simple hello world program. Which is why saying Kafka-Connect runs on YARN is technically correct. The caveat is that getting Kafka-Connect to run on YARN will take a fair amount of elbow grease at the moment. There are two ways to do it:
Directly talk to the YARN API to acquire a container, deploy the Kafka-Connect binaries and launch Kafka-Connect.
Use the separate Slider project https://slider.incubator.apache.org/docs/getting_started.html that Stephane has already mentioned in the comments.
Slider
You'll have to read quite a bit of documentation to get it working but the idea behind Slider is that you can get any program to run on YARN without dealing with the YARN API and writing a YARN app master by doing the following:
Create a slider package out of your program
Define a configuration for you package
Use the slider cli to deploy your application onto YARN
Slider handles container deployment and recovery of failed containers for you, which is nice. Also Slider is becoming a native part of YARN when YARN 3.0 is released.
Alternatives
Also as a side note, getting Kafka-Connect to deploy on Kubernetes or Mesos / Marathon is probably going to be easier. The basic workflow to do that would be:
Create a Kafka-Connect docker container or just use confluent's docker container
Create a deployment config for Kubernetes or Marathon
Click a button / run a command
Tutorials
A good Mesos / Marathon tutorial can be found here
Kubernetes tutorial here
Confluent Kubernetes Helm Charts here

How to create a Spark Streaming jar that would work in AWS EMR?

I've been developing a Spark Streaming application with Eclipse, and I'm using sbt to run it locally.
Now I want to deploy the application on AWS using a jar, but when I try to use the command package of sbt it creates a jar without all dependencies so when I upload it on AWS it won't work because of Scala being missing.
Is there a way to create a uber-jar with SBT? Am I doing something wrong with the deployment of Spark on AWS?
For creating uber-jar with sbt, use sbt plugin sbt-assembly. For more details about creating uber-jar using sbt-assembly refer the blog post
After creating you can run the assembly jar using java -jar command.
But from Spark-1.0.0 onwards the spark-submit script in Spark’s bin directory is used to launch applications on a cluster for more details refer here
You should really be following Running Spark on EC2 that reads:
The spark-ec2 script, located in Spark’s ec2 directory, allows you to
launch, manage and shut down Spark clusters on Amazon EC2. It
automatically sets up Spark, Shark and HDFS on the cluster for you.
This guide describes how to use spark-ec2 to launch clusters, how to
run jobs on them, and how to shut them down. It assumes you’ve already
signed up for an EC2 account on the Amazon Web Services site.
I've only partially followed the document so I can't comment on how well it's written.
Moreover, according to Shipping Code to the Cluster chapter in the other document:
The recommended way to ship your code to the cluster is to pass it
through SparkContext’s constructor, which takes a list of JAR files
(Java/Scala) or .egg and .zip libraries (Python) to disseminate to
worker nodes. You can also dynamically add new files to be sent to
executors with SparkContext.addJar and addFile.