Why "java.lang.ClassNotFoundException: Failed to find data source: kinesis" with spark-streaming-kinesis-asl dependency? - scala

My setup:
scala:2.11.8
spark:2.3.0.cloudera4
I have already add this in my .pom file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.11</artifactId>
<version>2.3.0</version>
</dependency>
However, when I run my spark-streaming code to consume data from kinesis, it returns:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kinesis.
I got a similar error when I consume data from Kafka and solved it by indicating the dependent jar in the submit command. But it seems this doesn't work this time:
sudo -u hdfs spark2-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.3.0 --class com.package.newkinesis --master yarn sparktest-1.0-SNAPSHOT.jar
How to address this issue? Any help is appreciated.
My code:
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val kinesis = spark.readStream
.format("kinesis")
.option("streamName", kinesisStreamName)
.option("endpointUrl", kinesisEndpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.option("awsAccessKey", awsAccessKeyId)
.option("awsSecretKey", awsSecretKey)
.load()
kinesis.writeStream.format("console").start().awaitTermination()
My full .pom file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.netease</groupId>
<artifactId>sparktest</artifactId>
<version>1.0-SNAPSHOT</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.8</scala.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<includes>
<include>org/apache/spark/*</include>
</includes>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<scope>provided</scope>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<scope>provided</scope>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<scope>provided</scope>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kinesis-asl_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
</project>

tl;dr It won't work.
You use spark-streaming-kinesis-asl_2.11 dependency that is for the older Spark Streaming API with the new Spark Structured Streaming and hence the exception.
You have to find a compatible Spark Structured Streaming data source for AWS Kinesis which is not officially supported by the Apache Spark project.

Related

Spark java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2

I currently am trying to spark-submit a fat jar to a local cluster, which I developed using Spark 2.4.6; Scala 2.11.12. Upon submitting to the cluster, I receive this error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2
My spark submit command (run in cmd prompt):
spark-submit --class main.app --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6 my_app_name-1.0-SNAPSHOT-jar-with-dependencies.jar
Other details:
Scala version: 2.11.12
Spark 2.4.6
When I submit using Spark 3.0.0 (i.e. pointing my SPARK_HOME to Spark 3.0.0 directory and submitting), it works fine, but when I submit using Spark 2.4.6 (i.e. pointing my SPARK_HOME to Spark 2.4.6 directory and submitting) I get that error
I have to use 2.4.6 (this cannot be changed)
My pom file
[....headers and stuff]
<groupId>org.example</groupId>
<artifactId>my_app_name</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<scala.version>2.11.12</scala.version>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.junit.jupiter/junit-jupiter-api -->
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-api</artifactId>
<version>5.6.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs</groupId>
<artifactId>specs</artifactId>
<version>1.2.5</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-avro -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.4.3</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>2.4.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-tools -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-tools</artifactId>
<version>2.4.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.4.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-streams -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.4.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.databricks/spark-csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>2.7.4</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>2.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3.3</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.2</version>
<configuration>
<recompileMode>incremental</recompileMode> <!-- NOTE: incremental compilation although faster requires passing to MAVEN_OPTS="-XX:MaxPermSize=128m" -->
<!-- addScalacArgs>-feature</addScalacArgs -->
<args>
<arg>-Yresolve-term-conflict:object</arg> <!-- required for package/object name conflict in Jenkins jar -->
</args>
<javacArgs>
<javacArg>-Xlint:unchecked</javacArg>
<javacArg>-Xlint:deprecation</javacArg>
</javacArgs>
</configuration>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<mainClass>
ingest_package.object_ingest
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
[....footers and stuff]
My Main App File
package main
import java.nio.file.{Files, Paths}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.avro.to_avro
import org.apache.spark.sql.functions.{date_format, struct}
object app {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("parquet_ingest_engine")
.getOrCreate()
Logger.getLogger("org").setLevel(Level.ERROR)
val accessKeyId = System.getenv("AWS_ACCESS_KEY_ID")
val secretAccessKey = System.getenv("AWS_SECRET_ACCESS_KEY")
val person_df = spark.read.format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat").load("s3_parquet_path_here")
val person_df_reformatted = person_df.withColumn("registration_dttm_string", date_format(person_df("registration_dttm"), "MM/dd/yyyy hh:mm"))
val person_df_final = person_df_reformatted.select("registration_dttm_string", "id", "first_name", "last_name", "email", "gender", "ip_address", "cc", "country", "birthdate", "salary", "title", "comments")
person_df_final.printSchema()
person_df_final.show(5)
val person_avro_schema = new String(Files.readAllBytes(Paths.get("input\\person_schema.avsc")))
print(person_avro_schema)
person_df_final.write.format("avro").mode("overwrite").option("avroSchema", person_avro_schema).save("output/person.avro")
print("\n" + "=====================successfully wrote avro to local path=====================" + "\n")
person_df_final.select(to_avro(struct("registration_dttm_string", "id", "first_name", "last_name", "email", "gender", "ip_address", "cc", "country", "birthdate", "salary", "title", "comments")) as "value")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "spark_topic_test")
.save()
print("\n" + "========================Successfully wrote to avro consumer on localhost kafka consumer========================" + "\n"+ "\n")
}
}
First, you have problems with dependencies:
you don't need com.databricks:spark-csv_2.11 - CSV support is in the Spark itself for a long time
you don't need Kafka dependencies except org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6
spark-sql and spark-core need to be declared with <scope>provided</scope> like here
it's better to use the same version of Spark dependencies as you're using for submission
Second, the problem could be from the incorrect Scala version (for example, you didn't do mvn clean when you changed it) - if you said that code works with Spark 3.0 then it should be compiled with Scala 2.12, while 2.4.6 works only with 2.11
I strongly recommend to get rid of unnecessary dependencies, use provided, do mvn clean, etc.
I met the same error. And solved it by using the jar with same scala version and spark version. I see the jar version you are using (org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6 ) is consistent with your spark, maybe you can try to change the version to a close one (e.g. org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 etc).
My spark is "version 2.4.4 using Scala version 2.11.12", when I read an avro file using the following jar(spark-avro_2.12), I got the exactly same error.
spark-shell --packages org.apache.spark:spark-avro_2.12:3.1.2
It was fixed after changing to "spark-shell --packages com.databricks:spark-avro_2.11:2.4.0".

spark scala maven error: SparkConf does not have a constructor

I have created a Maven project to run a wordcount spark-scala program. Here when I create my SparkConf it gives me an error "org.apache.spark.SparkConf does not have constructor". Similar for SparkContext
(org.apache.spark.SparkContext has no constructor)
I have imported both SparkContext and SparkConf and also written in the proper constructor format.This could be a Maven issue but no such error pops up related to that.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object WordCount {
def main(args: Array[String]) {
val cf = new SparkConf().setAppName("WordCount").setMaster("local")
val sc = new SparkContext(cf)
val rawData = sc.textFile("C:/Users/siddharth.shankar/Documents/input.txt")
val words = rawData.flatMap(line => line.split(" "))
val wordCount = words.map(word => (word, 1)).reduceByKey(_ + _)
wordCount.foreach(println)
}
}
Here is my pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.devinline.spark</groupId>
<artifactId>SparkSample2</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>SparkSample Maven Webapp</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-winutils</artifactId>
<version>2.7.1</version>
</dependency>
</dependencies>
<build>
<finalName>SparkSample2</finalName>
</build>
</project>
I don't know what the issue is here as if I apply the same program as a regular spark-scala(no maven) application the program runs without errors.
Check your scala version for both the cases is same or not?
It seams like version issue. I execute this code with maven it works fine with scala 2.11.
even with spark submit also it is working.
Try adding:
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
...
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>provided</scope>
</dependency>
...
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
...
<plugins>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.1</version>
<configuration>
<compilerPlugins>
<compilerPlugin>
<groupId>com.artima.supersafe</groupId>
<artifactId>supersafe_${scala.version}</artifactId>
<version>1.1.3</version>
</compilerPlugin>
</compilerPlugins>
</configuration>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-feature</arg>
<arg>-deprecation</arg>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
Use these dependencies will work fine.

Unable to execute Hive queries using spark-submit

I am not able run hive queries using spark-submit command. But, the same is getting executed in spark-shell. I am using AWS EMR as the cluster.
Below is my code written in eclipse scala IDE
object HiveTest {
def main(args: Array[String]): Unit =
{
val sparkConf = new SparkConf()
sparkConf.setAppName("WordCountTest")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
sqlContext.sql("select * from stream_table");
}
}
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark</groupId>
<artifactId>word-count</artifactId>
<version>0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>word-count</name>
<url>http://maven.apache.org</url>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<mainClass>
HiveTest
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<properties>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<scala.tools.version>2.11</scala.tools.version>
<spark.version>2.0.0</spark.version>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.tools.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>0.90.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.tools.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.tools.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
spark-submit command
spark-submit --master local[2] --class HiveTest
./word-count-0.1-SNAPSHOT-jar-with-dependencies.jar
Error
[hadoop#ip-10-134-23-168 jars]$ spark-submit --master local[2] --class HiveTest ./word-count-0.1-SNAPSHOT-jar-with-dependencies.jar
18/02/12 10:58:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
18/02/12 10:58:49 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:192)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
As the spark version is 2.0 it’s better to use the sparksession object instead of sparkcontext or sqlcontext. And you have to create sparksession object with hivesupport enabled.
It is running in spark shell because the spark session and sc are created with give support enabled.
The reason for failure is classpath. When I run spark-submit with dependency jar, default classpath of spark is not being utilized. Adding provided line in the POM dependencies resolved the issue.
Dependencies with the scope provided will not be added to dependency(word-count-0.1-SNAPSHOT-jar-with-dependencies.jar here) jar. They will be used only for compilation.
changed POM.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark</groupId>
<artifactId>word-count</artifactId>
<version>0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>word-count</name>
<url>http://maven.apache.org</url>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<mainClass>
HiveTest
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<properties>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<scala.tools.version>2.11</scala.tools.version>
<spark.version>2.1.0</spark.version>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.tools.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>0.90.0</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.tools.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.tools.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
</project>
spark-submit command
spark-submit --master local[2] --class HiveWordCountScala
./word-count-0.1-SNAPSHOT-jar-with-dependencies.jar

What is version library spark supported SparkSession

Code Spark with SparkSession.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
val conf = SparkSession.builder
.master("local")
.appName("testing")
.enableHiveSupport() // <- enable Hive support.
.getOrCreate()
Code pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cms.spark</groupId>
<artifactId>cms-spark</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>cms-spark</name>
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.5.3</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>install</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
I have some problem. I create code spark with SparkSession, iam get trouble SparkSession not find in library SparkSql. So iam can't run code spark. Iam question what is version to find SparkSession in library Spark. I give code pom.xml.
Thanks.
you need both core and SQL artifacts
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
</dependencies>
You need Spark 2.0 to use SparkSession. It's available in Maven central snapshot repository as for now:
groupId = org.apache.spark
artifactId = spark-core_2.11
version = 2.0.0-SNAPSHOT
The same version have to be specified for other Spark artifacts. Note, that 2.0 is still in beta and expected to be stable in about a month, AFAIK.
Update. Alternatively, you can use Cloudera fork of Spark 2.0:
groupId = org.apache.spark
artifactId = spark-core_2.11
version = 2.0.0-cloudera1-SNAPSHOT
Cloudera repository has to be specified in your Maven repositories list:
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>

spring data hive integration with hive template

I am trying to use spring data hadoop to integrate hive into my application and running into some issues. First thing I am not sure about is <hdp:hive-server host="some-other-host" port="10001" /> is this to connect to an existing hive server or to something like create a new hive server to then be able to connect to it. Secondly my configuration does not throws any errors so it does seems ok and even the hiveTemplate autowiring works fine too but when I execute a query I dont seem to get any response back. The application sort of gets stuck at that point.
here is the configuration
<hive-client-factory host="${hive-${env}.server}" port="${hive-${env}.port}" />
<hive-template />
and here is how im using it
log.debug("before hive query");
for(String result : hiveTemplate.query("show tables;")){
log.debug("=> " + result);
}
log.debug("after hive query");
all I see in log output is before hive query .. nothing happens after that. I would appreciate any help. Any ideas what I could be doing wrong.
Try 10000 as the port number.
Usually Thrift server is deployed with port number 10000. Check If your installation is using HiveServer2 or HiveServer. I was able to get Spring Batch workflows to work with HiveServer, but I have no yet succeeded with HiveServer2.
Ensure that you have HiveServer/HiveServer is up and running before ou run your program
If we are using with spring,please make sure all your dependency is compatible with spring version.After that give read and write permission using below commands.
hadoop fs -mkdir /tmp
hadoop fs -chmod a+w /tmp
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod a+w /user/hive/warehouse
I have used following version of dependency in pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<artifactId>spring-hadoop-samples-hive</artifactId>
<name>Spring Hadoop Samples - Hive</name>
<parent>
<groupId>org.springframework.samples</groupId>
<artifactId>spring-hadoop-samples</artifactId>
<version>1.0.0.BUILD-SNAPSHOT</version>
<relativePath>../parent/pom.xml</relativePath>
</parent>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spring.hadoop.version>2.3.0.M1</spring.hadoop.version>
<hadoop.version>2.7.1</hadoop.version>
<hive.version>1.2.1</hive.version>
<!-- <hive.version>2.1.1</hive.version> -->
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop</artifactId>
<version>${spring.hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>org.springframework</groupId>
<artifactId>spring-context-support</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-jdbc</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-tx</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>${hive.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-service</artifactId>
<version>${hive.version}</version>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>libfb303</artifactId>
<version>0.9.1</version>
</dependency>
<!-- runtime Hive deps start -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>${hive.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-shims</artifactId>
<version>${hive.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>${hive.version}</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-contrib</artifactId>
<version>${hive.version}</version>
<scope>runtime</scope>
</dependency>
<!-- runtime Hive deps end -->
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy</artifactId>
<version>1.8.5</version>
<scope>runtime</scope>
</dependency>
</dependencies>
<repositories>
<repository>
<id>spring-milestone</id>
<url>http://repo.spring.io/libs-milestone</url>
</repository>
</repositories>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>appassembler-maven-plugin</artifactId>
<version>1.2.2</version>
<configuration>
<repositoryLayout>flat</repositoryLayout>
<configurationSourceDirectory>src/main/config</configurationSourceDirectory>
<copyConfigurationDirectory>true</copyConfigurationDirectory>
<!-- Extra JVM arguments that will be included in the bin scripts -->
<extraJvmArguments>-Xms512m -Xmx1024m -Dhive.version=${hive.version}</extraJvmArguments>
<programs>
<program>
<mainClass>org.springframework.samples.hadoop.hive.HiveApp</mainClass>
<name>hiveApp</name>
</program>
<program>
<mainClass>org.springframework.samples.hadoop.hive.HiveClientApp</mainClass>
<name>hiveClientApp</name>
</program>
<program>
<mainClass>org.springframework.samples.hadoop.hive.HiveAppWithApacheLogs</mainClass>
<name>hiveAppWithApacheLogs</name>
</program>
</programs>
</configuration>
<executions>
<execution>
<id>package</id>
<goals>
<goal>assemble</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<executions>
<execution>
<id>config</id>
<phase>package</phase>
<configuration>
<tasks>
<copy todir="target/appassembler/data">
<fileset dir="data"/>
</copy>
</tasks>
</configuration>
<goals>
<goal>run</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>