sbt unresolved dependency for spark streaming Kafka integration - scala

I want to work with the Kafka integration for Spark streaming. I use Spark version 2.0.0.
But I get a unresolved dependency error ("unresolved dependency: org.apache.spark#spark-sql-kafka-0-10_2.11;2.0.0: not found").
How can I accesss this package? Or am I doing something wrong/missing?
My build.sbt file:
name := "Spark Streaming"
version := "0.1"
scalaVersion := "2.11.11"
val sparkVersion = "2.0.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion
)
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.0.0-preview"
Thank you for you help.

https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.11
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.0.0"

Related

Unable to start Spark application with Bahir

I am trying to run a Spark application in Scala to connect to ActiveMQ. I am using Bahir for this purpose format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider"). When I am using Bahir2.2 in my built.sbt the application is running fine but on changing it to Bahir3.0 or Bahir4.0 the application is not starting and it is giving an error:
[error] (run-main-0) java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream`
How to fix this? Is there an alternative of Bahir which I can use in my Spark-Structured-Streaming to connect to ActiveMQ topics?
EDIT:
my build.sbt
//For spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0" ,
"org.apache.spark" %% "spark-mllib" % "2.4.0" ,
"org.apache.spark" %% "spark-sql" % "2.4.0" ,
"org.apache.spark" %% "spark-hive" % "2.4.0" ,
"org.apache.spark" %% "spark-streaming" % "2.4.0" ,
"org.apache.spark" %% "spark-graphx" % "2.4.0",
)
//Bahir
libraryDependencies += "org.apache.bahir" %% "spark-sql-streaming-mqtt" % "2.4.0"
Okay, So it seems some kind of compatibility issue between spark2.4 and bahir2.4. I fixed it by rolling back both of them to ver 2.3.
Here is my build.sbt
name := "sparkTest"
version := "0.1"
scalaVersion := "2.11.11"
//For spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0" ,
"org.apache.spark" %% "spark-mllib" % "2.3.0" ,
"org.apache.spark" %% "spark-sql" % "2.3.0" ,
"org.apache.spark" %% "spark-hive" % "2.3.0" ,
"org.apache.spark" %% "spark-streaming" % "2.3.0" ,
"org.apache.spark" %% "spark-graphx" % "2.3.0",
// "org.apache.spark" %% "spark-streaming-kafka" % "1.6.3",
)
//Bahir
libraryDependencies += "org.apache.bahir" %% "spark-sql-streaming-mqtt" % "2.3.0"

sbt package not adding dependencies

I am trying to build jar using sbt package.
build.sbt:
name := "Simple Project"
version := "0.1"
scalaVersion := "2.11.8"
val sparkVersion = "2.3.2"
val connectorVersion = "2.3.0"
val cassandraVersion = "3.11"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"org.scalaj" %% "scalaj-http" % "2.4.2",
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion
)
The sbt package runs successfully but does not add spark-cassandra-connector and scalaj-http to the final jar created.
Do I need to add anything?
If you want the jar to contain all your dependencies, you have to use the sbt assemlbly plugin:
https://github.com/sbt/sbt-assembly

how to avoid a dependency to load in spark-streaming and kafka?

I'm trying to get an example of kafka and spark-streaming working and I find problems when running the process.
this is the exception:
[error] Caused by:
com.fasterxml.jackson.databind.JsonMappingException: Incompatible
Jackson version: 2.9.8
This is the build.sbt:
name := "SparkJobs"
version := "1.0"
scalaVersion := "2.11.6"
val sparkVersion = "2.4.1"
val flinkVersion = "1.7.2"
resolvers ++= Seq(
"Typesafe Releases" at "http://repo.typesafe.com/typesafe/releases/",
"apache snapshots" at "http://repository.apache.org/snapshots/",
"confluent.io" at "http://packages.confluent.io/maven/",
"Maven central" at "http://repo1.maven.org/maven2/"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion
// ,"org.apache.flink" %% "flink-connector-kafka-0.10" % flinkVersion
, "org.apache.kafka" %% "kafka-streams-scala" % "2.2.0"
// , "io.confluent" % "kafka-streams-avro-serde" % "5.2.1"
)
//excludeDependencies ++= Seq(
// commons-logging is replaced by jcl-over-slf4j
// ExclusionRule("jackson-module-scala", "jackson-module-scala")
//
)
This is the code
Doing a sbt dependencyTree I can see that spark-core_2.11-2.4.1.jar has jackson-databind-2.6.7.1, and it is telling me that it is evicted by 2.9.8 version, which it suggest that there is a collision between libraries, but spark-core_2.11-2.4.1.jar is not the only one, kafka-streams-scala_2.11:2.2.0 uses jackson-databind-2.9.8 version, so I don't know which library has to evict jackson-databind-2.9.8. Spark-core / kafka-streams-scala? or both?
How can I avoid jackson library version 2.9.8 in order to get this task up and running?
I am assuming that I need jackson-databind-2.6.7 version ...
UPDATE with advices. Still not working.
I have deleted dependencies of kafka-stream-scala, which tries to use jackson 2.9.8, using this build.sbt
name := "SparkJobs"
version := "1.0"
scalaVersion := "2.11.6"
val sparkVersion = "2.4.1"
val flinkVersion = "1.7.2"
val kafkaStreamScala = "2.2.0"
resolvers ++= Seq(
"Typesafe Releases" at "http://repo.typesafe.com/typesafe/releases/",
"apache snapshots" at "http://repository.apache.org/snapshots/",
"confluent.io" at "http://packages.confluent.io/maven/",
"Maven central" at "http://repo1.maven.org/maven2/"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion ,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion
)
But i got new exception
UPDATE 2
got it, now i understand the second exception, i forgot to awaitToTermination.
Kafka Streams includes Jackson 2.9.8
But you don't need it when using Spark Streaming's Kafka Integration, so you should really just remove it.
Similarly, the kafka-streams-avro-serde isn't what you want to be using with Spark, rather you might find AbraOSS/ABRiS useful instead.

Spark dependencies configuration for streaming from Twitter

I am trying to run a Spark application with a Twitter streaming. However, I constantly experiencing problems with dependencies.
When I use org.apache.bahir spark-streaming-twitter dependency I get such an error:
module not found: org.apache.bahir#spark-streaming-twitter;2.0.0
Here is the corresponding build.sbt file:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % "2.0.0",
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6"
)
But when I use older streaming dependency I get ClassNotFoundException: : org.apache.spark.Logging error.
Here is the corresponding build.sbt:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6",
"org.apache.spark" %% "spark-streaming-twitter" % "1.6.3"
)
In order to run my application, I run sbt clean and package commands.
So what dependencies should I use and how to configure them to run my application?
Twitter backend has been removed from Spark with 2.0 release and version of bahir you declared doesn't match Spark version. Finally bahir Twitter already comes with twitter4j-stream dependency (4.0.4 at this moment). Use:
val sparkVersion = "2.3.0"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion
)

build.sbt: how to add spark dependencies

Hello I am trying to download spark-core, spark-streaming, twitter4j, and spark-streaming-twitter in the build.sbt file below:
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
libraryDependencies ++= Seq(
"org.twitter4j" % "twitter4j-core" % "3.0.3",
"org.twitter4j" % "twitter4j-stream" % "3.0.3"
)
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "0.9.0-incubating"
I simply took this libraryDependencies online so I am not sure which versions, etc. to use.
Can someone please explain to me how I should fix this .sbt files. I spent a couple hours trying to figure it out but none of the suggesstion worked. I installed scala through homebrew and I am on version 2.11.8
All of my errors were about:
Modules were resolved with conflicting cross-version suffixes.
The problem is that you are mixing Scala 2.11 and 2.10 artifacts. You have:
scalaVersion := "2.11.8"
And then:
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
Where the 2.10 artifact is being required. You are also mixing Spark versions instead of using a consistent version:
// spark 1.6.1
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
// spark 1.4.1
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
// spark 0.9.0-incubating
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "0.9.0-incubating"
Here is a build.sbt that fixes both problems:
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-twitter" % sparkVersion
)
You also don't need to manually add twitter4j dependencies since they are added transitively by spark-streaming-twitter.
It works for me:
name := "spark_local"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.twitter4j" % "twitter4j-core" % "3.0.5",
"org.twitter4j" % "twitter4j-stream" % "3.0.5",
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"org.apache.spark" %% "spark-mllib" % "2.0.0",
"org.apache.spark" %% "spark-streaming" % "2.0.0"
)