Why does executing Structured Streaming application fail with "Failed to find data source: kafka"? [duplicate] - scala

This question already has answers here:
Why does format("kafka") fail with "Failed to find data source: kafka." (even with uber-jar)?
(8 answers)
Closed 4 years ago.
I am trying to connect Spark Structured Streaming with kafka and it throws the below error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at ...
Based on the documentation I have added the required dependencies
and my kafka and zookeeper servers are running.
Not sure what the issue is.
Also, I am using it this way
import spark.implicits._
val feedback =spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:2181").option("subscribe", "kafka_input_topic")
.load().as[InputMessage].filter(_.lang.equals("en"))
Any help is appreciated. Thank you

The problem, as you mentioned in your comments, is this:
<scope>provided</scope>
Remove the provided scope for sql-kafka, as it is not provided by the Spark installation.

you could use the kafka data source by the fully-qualified name (not the alias) as follows:
spark.readStream.format("org.apache.spark.sql.kafka010.KafkaSourceProvider").load

The issue is that the necessary jar is not included in CLASSPATH at runtime (not build time).
Based on the documentation you linked to you added the required dependencies to your build definition file (pom.xml or build.sbt or build.gradle), but the exception happens while you try to run the application which is after it is built, doesn't it?
What you miss is that part of the documentation about deployment, i.e. Deploying:
As with any Spark applications, spark-submit is used to launch your application. spark-sql-kafka-0-10_2.11 and its dependencies can be directly added to spark-submit using --packages, such as,
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 ..
You have to add this --packages or you'd have to create an uber-jar that would make the dependency part of your jar file.

If using maven then the following way of building jar with dependencies might solve your issue.
Add the spark dependencies like below:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.1</version>
<scope>${spark.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.1</version>
</dependency>
Then configure your maven profiles as below:
<profiles>
<profile>
<id>default</id>
<properties>
<profile.id>dev</profile.id>
<spark.scope>compile</spark.scope>
</properties>
<activation>
<activeByDefault>true</activeByDefault>
</activation>
</profile>
<profile>
<id>test</id>
<properties>
<profile.id>test</profile.id>
<spark.scope>provided</spark.scope>
</properties>
</profile>
<profile>
<id>online</id>
<properties>
<profile.id>online</profile.id>
<spark.scope>provided</spark.scope>
</properties>
</profile>
</profiles>
Add the followign plugin:
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
Then buld your jar using mvn clean install -Ponline -DskipTests. This should solve your issue

Related

Spark Maven dependency incompatibility between delta-core and spark-avro

I'm trying to add delta-core to my scala Spark project, running 2.4.4.
A weird behaviour I'm seeing is that it seems to be in conflict with spark avro. Maven build succeeds, but during runtime I'm getting errors.
If delta table dependency is declared first, I get a runtime error that spark avro is not installed:
User class threw exception: org.apache.spark.sql.AnalysisException:
Failed to find data source: avro. Avro is built-in but external data
source module since Spark 2.4. Please deploy the application as per
the deployment section of "Apache Avro Data Source Guide".;
<dependencies>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.11</artifactId>
<version>0.6.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>2.4.4</version>
</dependency>
if spark avro is defined first, Avro works, but delta gets an exception:
User class threw exception: java.lang.ClassNotFoundException: Failed
to find data source: delta. Please find packages at
http://spark.apache.org/third-party-projects.html
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.11</artifactId>
<version>0.6.1</version>
</dependency>
I thought this could be some kind of dependency conflict so I tried:
<exclusions>
<exclusion>
<groupId>*</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
on both, but it didn't help.
Got an answer for this in the delta-core issues page. Thanks zsxwing!
The complete solution is based on this previous stack overflow answer to merge the services under META-INF so the different spark sources wont override each-other.
The complete solution - I changed the maven assembly plugin to this:
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<descriptors>
<descriptor>${project.basedir}\src\assembly\tvm_assembly.xml</descriptor>
</descriptors>
</configuration>
</plugin>
And in the new tvm_assembly.xml file (based on the original jar-with-depndencies, added the properties for merge):
<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.1.0 http://maven.apache.org/xsd/assembly-2.1.0.xsd">
<!-- TODO: a jarjar format would be better -->
<id>jar-with-dependencies</id>
<formats>
<format>jar</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<dependencySets>
<dependencySet>
<outputDirectory>/</outputDirectory>
<useProjectArtifact>true</useProjectArtifact>
<unpack>true</unpack>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
<containerDescriptorHandlers>
<containerDescriptorHandler>
<handlerName>metaInf-services</handlerName>
</containerDescriptorHandler>
<containerDescriptorHandler>
<handlerName>metaInf-spring</handlerName>
</containerDescriptorHandler>
<containerDescriptorHandler>
<handlerName>plexus</handlerName>
</containerDescriptorHandler>
</containerDescriptorHandlers>
</assembly>

Exception while building Scala-Maven project on IntelliJ

I am trying to build a Scala-Maven project on IntelliJ IDEA. Right after creating the project, it says build successful. This is how my pom.xml looks like:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.dbloads.pgms</groupId>
<artifactId>Arts</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.7.0</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.4.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.4.0</version>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
Next I tried to add the compiler options in project like below:
Once added and when I click the RUN button, I get the below error message:
[ERROR] Plugin net.alchim31.maven:scala-maven-plugin:3.4.0 or one of its dependencies could not be resolved: Failed to read artifact descriptor for net.alchim31.maven:scala-maven-plugin:jar:3.4.0: Could not transfer artifact net.alchim31.maven:scala-maven-plugin:pom:3.4.0 from/to central (https://repo.maven.apache.org/maven2): PITC-Zscaler-EMEA-Amsterdam3PR.proxy.corporate.ge.com: Unknown host PITC-Zscaler-EMEA-Amsterdam3PR.proxy.corporate.ge.com -> [Help 1]
I am using the jdk version: 1.8.0_172 and I have added the Scala plugin directly from the plugins. Hence it is the latest version of the Scala.
Could anyone let me know how can I fix this problem.
It looks like you need to configure Maven and IntelliJ to use network proxy settings, since it looks like you might be behind a corporate firewall.
Maven has the ability to configure a proxy through its settings (in ~/.m2/settings.xml on Unix-like systems, or %HOME%\.m2\settings.xml on Windows), as follows:
<settings ...>
.
.
<proxies>
<!-- You can have one of these for each possible proxy. -->
<proxy>
<active>true</active>
<!-- Pick some ID for your proxy here. -->
<id>corp-proxy</id>
<!-- Choose your protocol here. E.g. "http", "socks4" or "socks5" -->
<protocol>http</protocol>
<!-- Specify the proxy server name (or IP address) and port of your proxy here. -->
<host>proxy.example.com</host>
<port>8080</port>
<!-- Identify any hosts here that you can access directly. It's unlikely that you'll
need this unless you have a proxy repository (such as Nexus, Artifactory, etc.) on
your corporate network. -->
<nonProxyHosts>www.google.com|*.example.com</nonProxyHosts>
<!-- The following fields are only necessary if required by your proxy. If you need to
enter your own username and password, make sure you do not add this file to version
control! -->
<username>proxyuser</username>
<password>somepassword</password>
</proxy>
</proxies>
.
.
</settings>
Meanwhile, IntelliJ is configured to use proxies through its settings. Refer to this answer for further details. (Note that setting proxy information via the JAVA_OPTS environment variable will work for running any Java/Scala/JVM application that needs to access the Internet via a proxy.)
Alternatively, if your proxy settings are configured correctly or are not required, it might be a temporary network connection issue, so make sure you have an Internet connection and try again. (The exception is a failure by Maven to download the plugin from the Maven central repository.)
BTW, the version of Scala you have specified—2.7.0—is ancient and almost certainly will not work with JDK 8. Either use an older JDK or a more recent version of Scala (the current release is 2.12.6).
Note that if you need to work with the current version of Apache Spark, you must currently use Scala 2.11.x - the most recent release being 2.11.12.
UPDATE:
From your comments, it seems there is some confusion about the roles played by Maven, the scala-maven-plugin, IntelliJ and the IntelliJ Scala plugin, so I'll quickly summarize them here. Please bear with me if I cover topics you're already familiar with...
Maven is a system for building and publishing software. (It actually does a lot more than just that, which is why Maven describes itself as project management software.) It allows developers to specify, in a single place, all of their software's dependencies (third-party libraries used by the software), which Maven downloads as required from the Central Maven Repository—mostly open-source—or from other, private repositories, as required. Further settings control how compilers are configured, tests are run, reports generated, etc.
Maven was developed primarily for development of Java-language projects. The scala-maven-plugin provides support for the Scala language and compiler within Maven. It is this plugin that downloads the version of the Scala compiler specified by your project and uses it to compile and build your sources.
If you look at your Maven project's pom.xml file, you will notice the following lines in the build section:
<project ...>
...
<build>
...
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.4.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
...
</build>
...
</project>
and again in the reporting section:
<project ...>
...
<reporting>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.4.0</version>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
In both cases, there is a line that reads <scalaVersion>${scala.version}</scalaVersion>, which tells Maven which version of Scala you want to use. The plugin then uses this version of the Scala compiler, and has Maven download it to a cached, local repository on your machine (typically, in C:\Users\{your account}\.m2 on a Windows machine). Note that both Maven and the scala-maven-plugin will ignore any versions of Scala you have installed on your machine.
So which version of Scala is the plugin going to download for you? The value ${scala.version} states that the version number is stored as the value of a property named scala.version. Your pom.xml file also has lines, near the top, that create this property:
<project ...>
...
<properties>
<scala.version>2.7.0</scala.version>
</properties>
...
</project>
So, you can see that you will use version 2.7.0 of the Scala compiler. If you want to use the latest Scala version, you should change this to:
<project ...>
...
<properties>
<scala.version>2.12.6</scala.version>
</properties>
...
</project>
OK, so now you will be using the latest version of the Scala compiler. Now let's move on to IntelliJ...
IntelliJ IDEA is an Integrated Development Environment (IDE), primarily aimed at development with the Java language. It provides syntax highlighting, smart code completion, and other features to simplify the process of writing code. In order to provide those features for the Scala programming language, you need to install its Scala plugin.
You can configure IntelliJ to use any version of Scala that you have installed on your machine. IntelliJ will then know how to compile, build and run your software and can work without using your Maven project object model (POM) file's build definition.
However, one of the reasons for using Maven is to ensure a consistent build environment for developing a project, so that it is not at the whim of whatever each developer may or may not have installed on their machine. For this reason, if a project uses Maven, it's a good idea to tell IntelliJ. That way, IntelliJ can use Maven's pom.xml file to specify the version of the compiler, download dependencies, configure the compiler settings, etc.
So, the above information should help you to get up-and-running with your project, working with your corporation's network proxy and using the latest version of Scala, using Maven and IntelliJ.

How to debug Dataflow/Apache Beam pipeline DoFn functions in eclipse using direct runner

I want to run my pipeline using direct runner in eclipse and put a break point in my DoFn functions and debug execution. I tried to setup direct runner with following steps:
Add direct runner maven package
Setup maven profile for direct runner in pom.xml. My pom.xml has this profile
<profiles>
<profile>
<id>direct-runner</id>
<activation>
<activeByDefault>true</activeByDefault>
</activation>
<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>0.2.0-incubating</version>
</dependency>
</dependencies>
</profile>
</profiles>
I have this maven plugin under plugin management in my pom.xml
<pluginManagement>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.4.0</version>
<executions>
<execution>
<goals>
<goal>java</goal>
</goals>
</execution>
</executions>
<configuration>
<cleanupDaemonThreads>false</cleanupDaemonThreads>
<mainClass>com.MyMainClass</mainClass>
</configuration>
</plugin>
</plugins>
</pluginManagement>
Below is a screen shot of my eclipse debug configuration
When I run using above debug configuration job starts in GCP dataflow instead of local JVM threads and my breakpoints are never hit.
Probably is the way how you are creating your pipeline in your test methods. Try to create the pipeline using the TestPipeline util class like this
public TestPipeline p = TestPipeline.create();

why am I able to create wsdl when I took away pluginManagement?

My entire pom.xml is below. With this pom I get this error in Eclipse "Plugin execution not covered by lifecycle configuration: org.apache.cxf:cxf-java2ws-plugin:3.1.8:java2ws (execution: process-classes, phase: process-classes)".
Nevertheless, it does work properly. I mean, if I "mvn clean package install" I get the output wsdl file desired.
If I added pluginManagement, the error in Eclipse desapears but I don't get the wsdl file desired neither I get an error in my console. The two closest discussions I found about it was "Publishing wsdl java M2E plugin execution not covered" and "How to solve "Plugin execution not covered by lifecycle configuration" for Spring Data Maven Builds" but I didn't understand them. As far as I can see, the idea is to change to take advantage of
"<lifecycleMappingMetadata>...<action><execute/>".
My straight question is: why does my below pom works when I take away pluginManagement? I guess, not sure, that I am missing a basic knowledgement about the relantionship between pluginManagement and execution. The most relevant part from my question is not what is worng with Eclipse (I found few people saying to ignore it).
I have been using pluginManagement for while but I have never wondering exactly what extra features it adds to my pom. Since now it is failing with java2ws, I am really interested to understand if there is any extra configuration I should add in my pom in order to get it up and running with pluginManagement and goal>java2ws.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>grp</groupId>
<artifactId>art</artifactId>
<packaging>war</packaging>
<version>0.0.1-SNAPSHOT</version>
<name>art Maven Webapp</name>
<url>http://maven.apache.org</url>
<properties>
<jdk.version>1.8</jdk.version>
<cxf.version>3.1.8</cxf.version>
<spring.version>4.3.4.RELEASE</spring.version>
<!-- <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> -->
</properties>
<dependencies>
<!-- Spring dependencies -->
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-web</artifactId>
<version>${spring.version}</version>
</dependency>
<!-- Apache cxf dependencies -->
<dependency>
<groupId>org.apache.cxf</groupId>
<artifactId>cxf-rt-frontend-jaxws</artifactId>
<version>${cxf.version}</version>
</dependency>
<dependency>
<groupId>org.apache.cxf</groupId>
<artifactId>cxf-rt-transports-http</artifactId>
<version>${cxf.version}</version>
</dependency>
<!-- servlet & jsp -->
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<version>3.0.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>javax.servlet.jsp</groupId>
<artifactId>javax.servlet.jsp-api</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>jstl</artifactId>
<version>1.1.2</version>
</dependency>
</dependencies>
<build>
<finalName>art</finalName>
<!-- <pluginManagement> -->
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>${jdk.version}</source>
<target>${jdk.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.cxf</groupId>
<artifactId>cxf-java2ws-plugin</artifactId>
<version>${cxf.version}</version>
<executions>
<execution>
<id>process-classes</id>
<phase>process-classes</phase>
<configuration>
<className>art.VmxService</className>
<outputFile>${project.basedir}/src/main/resources/VmxService.wsdl</outputFile>
<genWsdl>true</genWsdl>
<verbose>true</verbose>
<address>http://localhost:9080/art/VmxService</address>
</configuration>
<goals>
<goal>java2ws</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
<!-- </pluginManagement> -->
</build>
</project>
The pluginManagement section serves a similar purpose like the dependencyManagement section. It defines plugins and their version and configuration defaults, without actually adding them to the maven build lifecycle.
Once the plugin is added in a module it will pick up the configuration from the pluginManagement section.
Also see: Maven: What is pluginManagement?
So if a similar configuration of the same plugin is used in multiple modules you can collect them together in one place. If the plugin is only used in one module I prefer to just put it in there directly in the build. But both ways work.
Remember you also need to add the plugin to the build.plugins - simply having them in pluginManagement does nothing.
The warning in eclipse relates more to the life-cycle of your IDE. It differs a bit from the maven lifecycle and in some cases it cannot detect (or could not?) at what moment a plugin is supposed to run. Some plugins also cannot execute without a maven project. So I'm never sure what that lifecycle-mapping plugin tries to solve :/
Anyways: if you generate the classes using a maven build and this works for you (not having that done when telling eclipse to 'build' the project without maven) you're good.
I thought that information (the lifecycle mapping) is nowadays baked into the plugins directly and read by the m2eclipse plugin. I've seen such xml files in some plugins. So the lifecycle-mapping plugin might not be required anymore at all.

How to pre-package external libraries when using Spark on a Mesos cluster

According to the Spark on Mesos docs one needs to set the spark.executor.uri pointing to a Spark distribution:
val conf = new SparkConf()
.setMaster("mesos://HOST:5050")
.setAppName("My app")
.set("spark.executor.uri", "<path to spark-1.4.1.tar.gz uploaded above>")
The docs also note that one can build a custom version of the Spark distribution.
My question now is whether it is possible/desirable to pre-package external libraries such as
spark-streaming-kafka
elasticsearch-spark
spark-csv
which will be used in mostly all of the job-jars I'll submit via spark-submit to
reduce the time sbt assembly need to package the fat jars
reduce the size of the fat jars which need to be submitted
If so, how can this be achieved? Generally speaking, are there some hints on how the fat jar generation on job submitting process can be speed up?
Background is that I want to run some code-generation for Spark jobs, and submit these right away and show the results in a browser frontend asynchronously. The frontend part shouldn't be too complicated, but I wonder how the backend part can be achieved.
Create sample maven project with your all dependencies and then use maven plugin maven-shade-plugin. It will create one shade jar in your target folder.
Here is sample pom
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com</groupId>
<artifactId>test</artifactId>
<version>0.0.1</version>
<properties>
<java.version>1.7</java.version>
<hadoop.version>2.4.1</hadoop.version>
<spark.version>1.4.0</spark.version>
<version.spark-csv_2.10>1.1.0</version.spark-csv_2.10>
<version.spark-avro_2.10>1.0.0</version.spark-avro_2.10>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<!-- <minimizeJar>true</minimizeJar> -->
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
<exclude>org/bdbizviz/**</exclude>
</excludes>
</filter>
</filters>
<finalName>spark-${project.version}</finalName>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency> <!-- Hadoop dependency -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<artifactId>servlet-api</artifactId>
<groupId>javax.servlet</groupId>
</exclusion>
<exclusion>
<artifactId>guava</artifactId>
<groupId>com.google.guava</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>2.4</version>
</dependency>
<dependency> <!-- Spark Core -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Spark SQL -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Spark CSV -->
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>${version.spark-csv_2.10}</version>
</dependency>
<dependency> <!-- Spark Avro -->
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.10</artifactId>
<version>${version.spark-avro_2.10}</version>
</dependency>
<dependency> <!-- Spark Hive -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency> <!-- Spark Hive thriftserver -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive-thriftserver_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
When you say pre-package do you really mean distribute to all the slaves and set up the jobs to use those packages so that you don't need to download those every time? That might be an option, however it sounds a bit cumbersome because distributing everything to the slaves and keeping all the packages up to date is not an easy task.
How about breaking your .tar.gz into smaller pieces, so that instead of a single fat file your jobs fetch several smaller files? In this case it should be possible to leverage the Mesos Fetcher Cache. So you'll see bad performance when the agent cache is cold, but once it warms up (i.e. once one job runs and downloads the common files locally) consecutive jobs will complete faster.
Yeah, you can copy the dependencies out to the workers and put them in the system-wide jvm lib directory in order to get them on the classpath.
Then you can mark those dependencies as provided in your sbt build, and they won't be included in the assembly. This does speed up assembly and transfer time.
I haven't tried this on mesos specifically, but have used it on spark standalone for things that are in every job and rarely change.
After I discovered the Spark JobServer project, I decided that this is the most suitable one for my use case.
It supports dynamic context creation via a REST API, as well as adding JARs to the newly created context manually/programmatically. It also is capable of runnign low-latency synchronous jobs, which is exactly what I need.
I created a Dockerfile so you can try it out with the most recent (supported) versions of Spark (1.4.1), Spark JobServer (0.6.0) and buit-in Mesos support (0.24.1):
https://github.com/tobilg/docker-spark-jobserver
https://hub.docker.com/r/tobilg/spark-jobserver/
References:
https://github.com/spark-jobserver/spark-jobserver#features
https://github.com/spark-jobserver/spark-jobserver#context-configuration