Dependency issue with spark-sql-kafka with spark-submit - scala

I have written a simple driver class in scala that uses spark-sql-kafka for structured streaming. I have used eclipse+maven to package it into a jar. Relevant part of pom.xml file is as follows:
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.5.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.0.2</version>
<scope>provided</scope>
</dependency>
</dependencies>
The, I submit the resulting jar file to spark-submit using following command:
spark-submit --properties-file {path}/kafka-streaming-conf --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.0.2 --class TestStreamDriver --master yarn {path}/StructuredStreaming-1.0-SNAPSHOT.jar
kafka-streaming-conf is as follows:
spark.executor.extraJavaOptions -Dhttp.proxyHost=proxyName -Dhttp.proxyPort=8080 -Dhttps.proxyHost=proxyName -Dhttps.proxyPort=8080
spark.jars.ivySettings {path}/ivysettings_proxy.xml
ivysettings_proxy.xml file is as follows:
<ivysettings>
<settings defaultResolver="default" />
<credentials host = "proxyName:8080" username = "" passwd = ""/>
<include url="${ivy.default.settings.dir}/ivysettings-public.xml" />
<include url="${ivy.default.settings.dir}/ivysettings-shared.xml" />
<include url="${ivy.default.settings.dir}/ivysettings-local.xml" />
<include url="${ivy.default.settings.dir}/ivysettings-main-chain.xml" />
<include url="${ivy.default.settings.dir}/ivysettings-default-chain.xml"/>
</ivysettings>
I also changed JAVA_OPTS variable by:
export JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxyName -Dhttp.proxyPort=8080 -Dhttps.proxyHost=proxyName -Dhttps.proxyPort=8080"
When I run spark-submit with above command, it tries to download from maven repository and other urls and then exists with Connection timed out error.
How can I make spark-submit download dependencies through a proxy?
Thanks.

What worked for me is:
I changed spark-submit properties file to:
spark.driver.extraJavaOptions -Dhttp.proxyHost=proxyName -Dhttp.proxyPort=8080 -Dhttps.proxyHost=proxyName -Dhttps.proxyPort=8080
spark.executor.extraJavaOptions -Dhttp.proxyHost=proxyName -Dhttp.proxyPort=8080 -Dhttps.proxyHost=proxyName -Dhttps.proxyPort=8080
which led to a certificate error.
Then I added certificate for https://repo.maven.apache.org/maven2/
to {path}/jdk1.8.0_144\jre\lib\security\cacerts file. (I used a free program called portecle to add certificates to cacerts file. )
Since I run spark-submit in yarn mode, I had to copy new cacerts file to all nodes with:
pscp.pssh -h cluster-hosts ./cacerts {path}/jdk1.8.0_40/jre/lib/security/

Related

Why does console keeps saying "uber-jar-6.5.5 seems corrupted"

I was installing my project from IntelliJ to aem using this command (mvn clean install -PautoInstall) and I keep having this error
The JAR/ZIP file (C:\Users....m2\repository\com\adobe\aem\uber-jar\6.5.5\uber-jar-6.5.5.jar) seems corrupted, error: error in opening zip file
I have already tried deleting and downloading the uber-jar but to no avail.
Here is my Core Pom.xml
4.0.0
com.startsite
startsite
1.0-SNAPSHOT
../pom.xml
<artifactId>startsite.core</artifactId>
<packaging>bundle</packaging>
<name>Start Site - Core</name>
<description>Core bundle for Start Site</description>
<build>
<plugins>
<plugin>
<groupId>org.apache.sling</groupId>
<artifactId>maven-sling-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.felix</groupId>
<artifactId>maven-bundle-plugin</artifactId>
<extensions>true</extensions>
<executions>
<execution>
<id>bundle-manifest</id>
<phase>process-classes</phase>
<goals>
<goal>manifest</goal>
</goals>
</execution>
</executions>
<configuration>
<instructions>
<!-- Import any version of javax.inject, to allow running on multiple versions of AEM -->
<Import-Package>javax.inject;version=0.0.0,*</Import-Package>
<Sling-Model-Packages>
com.startsite.core
</Sling-Model-Packages>
</instructions>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- OSGi Dependencies -->
<dependency>
<groupId>org.osgi</groupId>
<artifactId>osgi.core</artifactId>
</dependency>
<dependency>
<groupId>org.osgi</groupId>
<artifactId>osgi.cmpn</artifactId>
</dependency>
<dependency>
<groupId>org.osgi</groupId>
<artifactId>osgi.annotation</artifactId>
</dependency>
<!-- Other Dependencies -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>javax.jcr</groupId>
<artifactId>jcr</artifactId>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</dependency>
<dependency>
<groupId>com.adobe.aem</groupId>
<artifactId>uber-jar</artifactId>
<classifier>apis</classifier>
</dependency>
<dependency>
<groupId>org.apache.sling</groupId>
<artifactId>org.apache.sling.models.api</artifactId>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
</dependency>
<dependency>
<groupId>junit-addons</groupId>
<artifactId>junit-addons</artifactId>
</dependency>
<dependency>
<groupId>com.adobe.aem</groupId>
<artifactId>uber-jar</artifactId>
<version>6.5.5</version>
</dependency>
<dependency>
<groupId>com.day.commons</groupId>
<artifactId>day.commons.datasource.poolservice</artifactId>
<version>1.0.10</version>
</dependency>
</dependencies>
Try deleting your local ~/.m2 repository and try again.
Why does console keeps saying "uber-jar-6.5.5 seems corrupted"
It will be saying that because the file is corrupt ... or not a JAR file at all.
Take a look at what is actually in the (supposed) JAR file:
Use jar -tvf <pathname> to list the index. That should tell you if the file is corrupt, etcetera.
On linux, use file <pathname> to try to determine what file type it really is.
If it is text, look at it using a text editor.
(My guess is that the file is not downloading properly. My second guess is that what you actually have there is an HTML document containing an error message from the failed download. If this is the case, there should be some clues in the error message as to why the download is failing.)
Take a look at the pom. I see the uber-jar declared twice, remove the first and keep only the one with the version on it. Also, the uber-jar should have the scope as provided:
<dependency>
<groupId>com.adobe.aem</groupId>
<artifactId>uber-jar</artifactId>
<version>6.5.8</version>
<scope>provided</scope>
</dependency>
Then, delete the uber-jar from your m2 folder and execute mvn -U clean install -PautoInstall
I found an alternate solution to solve this problem. I used Eclipse neon instead of the usual eclipse and intellij. Newly created project now works except the one that I have previously compiled on Intellij and eclipse.

Spark Streaming Kafka: ClassNotFoundException for ByteArrayDeserializer when run with spark-submit

I'm new to Scala / Spark Streaming, and to StackOverflow so please excuse my formatting. I have made a Scala app that reads log files from a Kafka Stream. It runs fine within the IDE, but I'll be damned if I can get it to run using spark-submit. It always fails with:
ClassNotFoundException: org.apache.kafka.common.serialization.ByteArrayDeserializer
The line referenced in the Exception is the load command in this snippet:
val records = spark
.readStream
.format("kafka") // <-- use KafkaSource
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", kafkaBroker) // 192.168.4.86:9092
.load()
.selectExpr("CAST(value AS STRING) AS temp")
.withColumn("record", deSerUDF($"temp"))
IDE: IntelliJ
Spark: 2.2.1
Scala: 2.11.8
Kafka: kafka_2.11-0.10.0.0
Relevant parts of pom.xml:
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>2.2.1</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.github.scala-incubator.io</groupId>
<artifactId>scala-io-file_2.11</artifactId>
<version>0.4.3-1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.0.0</version>
<!-- version>2.0.0</version -->
</dependency>
Note: I don't think it is related, but I have to use zip -d BroLogSpark.jar "META-INF/*.SF" and zip -d BroLogSpark.jar "META-INF/*.DSA" to get past meaning about the manifest signatures.
My jar file does not include any of org.apache.kafka. I have seen several posts that strongly suggest I have a mismatch in versions, and I have tried countless permutations of changes to pom.xml and spark-submit. After each change, I confirm that it still runs within the IDE, then proceed to try using spark-submit on the same system, same user. Below is my most recent attempt, where my BroLogSpark.jar is in the current directory and "192.168.4.86:9092 profile" are input arguments.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.1,org.apache.kafka:kafka-clients:0.10.0.0 BroLogSpark.jar 192.168.4.86:9092 BroFile
Add below dependency too
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.0.0</version>
</dependency>

com.sun.enterprise.admin.remote.RemoteFailureException

I am new to REST webservices and I am trying to build a webservice using jersey to upload a file. When deploying, i get this error
Artifact RestTest:war exploded: java.io.IOException: com.sun.enterprise.admin.remote.RemoteFailureException: Error occurred during deployment: Exception while loading the app : CDI definition failure:WELD-000071: Managed bean with a parameterized bean class must be #Dependent: class org.glassfish.jersey.process.internal.DefaultRespondingContext. Please see server.log for more details.
I created the project using this command:
mvn archetype:generate -DarchetypeArtifactId=jersey-quickstart-webapp \
-DarchetypeGroupId=org.glassfish.jersey.archetypes -DinteractiveMode=false \
-DgroupId=com.test -DartifactId=RestTest -Dpackage=com.test \
-DarchetypeVersion=2.27
My pom.xml looks like:
<dependencies>
<dependency>
<groupId>org.glassfish.jersey.containers</groupId>
<artifactId>jersey-container-servlet-core</artifactId>
</dependency>
<dependency>
<groupId>org.glassfish.jersey.inject</groupId>
<artifactId>jersey-hk2</artifactId>
</dependency>
<dependency>
<groupId>org.glassfish.jersey.media</groupId>
<artifactId>jersey-media-multipart</artifactId>
<scope>provided</scope>
<version>2.0-m05-2</version>
</dependency>
<dependency>
<groupId>org.glassfish.jersey.core</groupId>
<artifactId>jersey-common</artifactId>
<version>2.0-m05-2</version>
</dependency>
</dependencies>
According to this answer, i've changed to scope to 'provided', but it isn't working.
Thanks in advance :)

The error java.lang.ClassNotFoundException when running spark-submit

I want run spark-submit for my Scala Spark application. These are the steps I did:
1) execute Maven Clean and Package from IntellijIDEA to get myTest.jar
2) execute the following spark-submit command:
spark-submit --name 28 --master local[2] --class org.test.consumer.TestRunner \
/usr/tests/test1/target/myTest.jar \
$arg1 $arg2 $arg3 $arg4 $arg5
This is the TestRunner object that I want to run:
package org.test.consumer
import org.test.consumer.kafka.KafkaConsumer
object TestRunner {
def main(args: Array[String]) {
val Array(zkQuorum, group, topic1, topic2, kafkaNumThreads) = args
val processor = new KafkaConsumer(zkQuorum, group, topic1, topic2)
processor.run(kafkaNumThreads.toInt)
}
}
But the spark-submit command fails with the following message:
java.lang.ClassNotFoundException: org.test.consumer.TestRunner
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:686)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
I don't really understand why the object TestRunner cannot be found, if the package is specified correctly... Has it something to do with the usage of object instead of class?
UPDATE:
The project structure (the folder scala is currently marked as Sources):
/usr/tests/test1
.idea
src
main
docker
resources
scala
org
test
consumer
kafka
KafkaConsumer.scala
TestRunner.scala
test
target
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.test.abc</groupId>
<artifactId>consumer</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.11</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.sedis</groupId>
<artifactId>sedis_2.11</artifactId>
<version>1.2.2</version>
</dependency>
<dependency>
<groupId>com.lambdaworks</groupId>
<artifactId>jacks_2.11</artifactId>
<version>2.3.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib-local_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>com.github.nscala-time</groupId>
<artifactId>nscala-time_2.11</artifactId>
<version>2.12.0</version>
</dependency>
</dependencies>
</project>
#FiofanS, the problem is in your directory structure.
Maven uses a convention over configuratation policy. It means, by default, maven expects that you will follow the set of rules that it has defined. For example, it expects you to put all your code in src/main/java directory (See Maven Standard Directory Structure). But you don't have your code in src/main/java directory. Instead, you have it in src/main/scala directory. By default, maven will not consider src/main/scala as source location.
Although , maven expects you to follow the rules it has defined, but it doesn't enforce them. It also provides you ways in which you can configure things based on your preference.
In your case, you will have to explicitly instruct maven to consider src/main/scala also as one of your source location.
To do this, you will have to use the Maven Build Helper Plugin.
Add the below piece of code within the <project>...</project> tag in your pom.xml
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<version>1.7</version>
<executions>
<execution>
<id>add-source</id>
<phase>generate-sources</phase>
<goals>
<goal>add-source</goal>
</goals>
<configuration>
<sources>
<source>src/main/scala</source>
</sources>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
This should solve your problem.

Spark compiles but unable to run a test program within Intellij?

Spark has compiled within Intellij via Maven. I am running one of the test suites. It does launch but fails at a basic scala library. What is going on?
Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce$class
Note that this is a maven project and the tests run successfully from the command line using mvn test.
Here is scala library info:
Here is the project definition:
Here is the module info showing the scala 2.11 dependency:
Here is the run configuration:
Here is the result of running:
UPDATE I was asked about the pom.xml. It is the pom.xml from spark for scala-2.11. https://github.com/apache/spark/blob/master/pom.xml
Here is the snippet
<profile>
<id>scala-2.11</id>
<activation>
<property><name>scala-2.11</name></property>
</activation>
<properties>
<scala.version>2.11.7</scala.version>
<scala.binary.version>2.11</scala.binary.version>
</properties>
</profile>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-actors</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scalap</artifactId>
<version>${scala.version}</version>
</dependency>
You need to make sure that Spark lib is on your class_path.
To build it run
build/mvn -DskipTests clean package
then include '/assembly/target/scala-$SCALA_VERSION/spark-assemblyhadoop-deps.jar' to your project.