How to create a Spark Streaming jar that would work in AWS EMR? - eclipse

I've been developing a Spark Streaming application with Eclipse, and I'm using sbt to run it locally.
Now I want to deploy the application on AWS using a jar, but when I try to use the command package of sbt it creates a jar without all dependencies so when I upload it on AWS it won't work because of Scala being missing.
Is there a way to create a uber-jar with SBT? Am I doing something wrong with the deployment of Spark on AWS?

For creating uber-jar with sbt, use sbt plugin sbt-assembly. For more details about creating uber-jar using sbt-assembly refer the blog post
After creating you can run the assembly jar using java -jar command.
But from Spark-1.0.0 onwards the spark-submit script in Spark’s bin directory is used to launch applications on a cluster for more details refer here

You should really be following Running Spark on EC2 that reads:
The spark-ec2 script, located in Spark’s ec2 directory, allows you to
launch, manage and shut down Spark clusters on Amazon EC2. It
automatically sets up Spark, Shark and HDFS on the cluster for you.
This guide describes how to use spark-ec2 to launch clusters, how to
run jobs on them, and how to shut them down. It assumes you’ve already
signed up for an EC2 account on the Amazon Web Services site.
I've only partially followed the document so I can't comment on how well it's written.
Moreover, according to Shipping Code to the Cluster chapter in the other document:
The recommended way to ship your code to the cluster is to pass it
through SparkContext’s constructor, which takes a list of JAR files
(Java/Scala) or .egg and .zip libraries (Python) to disseminate to
worker nodes. You can also dynamically add new files to be sent to
executors with SparkContext.addJar and addFile.

Related

Distribute third-party jar dependency on large-scale spark application

We have a third-party jar file on which our Spark application is dependent that. This jar file size is ~15MB. Since we want to deploy our Spark application on a large-scale cluster(~500 workers), we are concerned about distributing the third-party jar file. According to the Apache Spark documentation(https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management), we have such options as using HDFS, HTTP server, driver HTTP server, and local-path for distributing the file.
We do not prefer to use local-path because it requires copying the jar file on all workers' spark libs directory. On the other hand, if we use HDFS or HTTP server when spark workers try to get the jar file, they may make a DoS attack against our Spark driver server. So, What is the best way to address this challenge?
On the other hand, if we use HDFS or HTTP server when spark workers try to get the jar file, they may make a DoS attack against our Spark driver server. So, What is the best way to address this challenge?
If you put the 3rd jar in HDFS why it effect the spark driver server?!
each node should take the addintiol jar directly from the hdfs not from the spark server.
After examining different proposed methods, we found out that the best way to distribute the jar files is to copy all nodes(as #egor mentioned, too.) based on our deployment tools, we released a new spark ansible role that supports external jars. At this moment, Apache Spark does not provide any optimized solution. If you give a remote URL (HTTP, HTTPS, HDFS, FTP) as an external jar path, spark workers fetch the jar file every time a new job submit. So, it is not an optimized solution from a network perspective.

Deploy DataFlow job using Scio

I've started developing my first DataFlow job using Scio, the Scala SDK. The dataflow job will run in streaming mode.
Can anyone advise the best way to deploy this? I have read in the Scio docs they use sbt-pack and then deploy this within a Docker container. I have also read about using DataFlow templates (but not in great detail).
What's best?
Like for Java and Python version, you can run directly your code on Dataflow by using the dataflow runner and by launching it from your computer (or a VM/function).
If you want to package it for a reutilisation, you can create a template.
You can't run custom container on Dataflow.

scala spark cassandra installation

How many ways are there to run Spark? If I just declare dependencies in build.sbt, Spark is supposed to be downloaded and works?
But if I want to run Spark locally (download the Spark tar file, winutils...), how can I specify in scala code that I want to run my code against the local Spark and not against the dependencies downloaded in the IntelliJ?
In order to connect Spark to Cassandra, do I need a local installation of Spark? I read somewhere it's not possible to connect from a "programmatically" Spark to a local Cassandra database
1) Spark runs in a slightly strange way, there is your application (the Spark Driver and Executors) and there is the Resource Manager (Spark Master/Workers, Yarn, Mesos or Local).
In your code you can run against the in process manager (local) by specifying the master as local or local[n]. The Local mode requires no installation of Spark as it will be automatically setup in the process you are running. This would be using the dependencies you downloaded.
To run against a Spark Master which is running locally, you use a spark:// url that points at your particular local Spark Master instance. Note that this will cause executor JVMS to start separate from your application necessitating the distribution of application code and dependencies. (Other resource managers have there own identifying urls)
2) You do not need a "Resource Manager" to connect to C* from Spark but this ability is basically for debugging and testing. To do this you would use the local master url. Normal Spark usage should have an external Resource Manager because without an external resource manager the system cannot be distributed.
For some more Spark Cassandra examples see
https://github.com/datastax/SparkBuildExamples

Spring Task in Spring Cloud Dataflow on PCF can't find java

i have a Spring Cloud Task fat jar that i have successfully deployed to SCDF running on PCF. i have created a definition for it and can therefore run it from the dashboard. fwiw it reads and writes from a database using Spring JDBC.
i'm trying to now set it up to run in a scheduled way and am having issues. i created a stream with a triggertask source and a task-launcher-local sink, and have configured the triggertask URI to point to the fat jar (via http, using a staticfile PCF pushed app).
the dashboard shows the two PCF apps (one for triggertask, one for task-local-launcher) both starting successfully, and it all runs, but the task fails every time with the error:
Caused by: java.io.IOException: Cannot run program "java" (in directory "/home/vcap/tmp/spring-cloud-dataflow-5903184636016162160/Task--582903409-1502669137014/Task--582903409"): error=2, No such file or directory
from what i can tell and surmise, the PCF app running the stream tries to fork and exec a java call, but since java is not in the path for PCF app containers i get the error
am i right? either way, how can i get the Spring Cloud Task (jar) to successfully run?
Spring Cloud Data Flow: Server
1.2.3 (using built spring-cloud-dataflow-server-cloudfoundry-1.2.3.BUILD-SNAPSHOT.jar)
Spring Cloud Data Flow: Shell
1.2.3 (using downloaded spring-cloud-dataflow-shell-1.2.3.RELEASE.jar)
Deployment Environment
PCF v1.11.6 (on Azure)
pcf dev v0.26.0 (on mac)
App Starters
http://bit-dot-ly/1-0-4-GA-stream-applications-rabbit-maven
Logs
link to log
The stream definition is missing from the post. It is possible that you're using the tasklauncher-local sink, which is compatible only when using SCDF's local-server and it will fail with the attached error when running in CF. Please make sure you're using tasklauncher-cloudfoundry sink. This application was added in the latest release of app-starters.
As pointed in the previous SO thread, it is highly recommended that you use the latest release of app-starters (1.0.4 is at least 10 months old). The latest releases can be found at the project site.

Installing a spark cluster on a hadoop cluster

I am trying to install an apache spark cluster on a hadoop cluster.
I am looking for best pracises in this regard. I am assuming that the spark master needs to be installed on the same machine as the hadoop namenode and the spark slaves on the hadoop datanodes. Also, where all do I need to install scala? Please advise.
If your Hadoop cluster is running YARN just use yarn mode for submitting your applications. That's going to be the easiest method without requiring you to install anything beyond simply downloading the Apache Spark distribution to a client machine. One additional thing you can do is deploy the Spark assembly to HDFS so that you can use the spark.yarn.jar config when you call spark-submit so that the JAR is cached on the nodes.
See here for all the details: http://spark.apache.org/docs/latest/running-on-yarn.html