how to import crossValidatorModel using sbt for scala spark - scala

i have a problem about CrossValidatorModel using scala sbt
this is my dependencies
libraryDependencies ++= Seq(
// spark
"org.apache.spark" %% "spark-core" % "2.3.1" % "provided" ,
"org.apache.spark" %% "spark-sql" % "2.3.1" % "provided",
"org.apache.spark" %% "spark-mllib" % "2.3.1" % "provided",
// protobuf
"com.thesamet.scalapb" %% "scalapb-runtime" % scalapbVersion % "protobuf",
//for grpc
"io.grpc" % "grpc-netty" % grpcJavaVersion,
"com.thesamet.scalapb" %% "scalapb-runtime-grpc" % scalapbVersion
)
and this my code
but while i import CrossValidator, that give me an error like this
import org.apache.spark.ml.tuning.CrossValidatorModel
(grpc-default-executor-0) java.lang.NoClassDefFoundError: org/apache/spark/ml/tuning/CrossValidatorModel$
java.lang.NoClassDefFoundError: org/apache/spark/ml/tuning/CrossValidatorModel$
at MlModel$lr_model$.<init>(server.scala:28)
at MlModel$lr_model$.<clinit>(server.scala)
at HelloWorldServer$RouteGuideImpl.getLabel(server.scala:77)
i was tried using scala 2.11 and 2.12, spark 2.3.1, 2.1.0 but this gives me the same error.
thanks

solve by changing this
"org.apache.spark" %% "spark-mllib" % "2.3.1" % "provided",
to
"org.apache.spark" %% "spark-mllib" % "2.3.1",

Related

Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0

I have a sbt projects and I want to make a test with scala test and shared spark session. Several weeks ago my project started to make an error.
java.lang.ExceptionInInitializerError
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
.....
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0
at com.fasterxml.jackson.module.scala.JacksonModule.setupModule(JacksonModule.scala:61)
at com.fasterxml.jackson.module.scala.JacksonModule.setupModule$(JacksonModule.scala:46)
There is a very simple test
import org.apache.spark.sql.QueryTest.checkAnswer
import org.apache.spark.sql.Row
import org.apache.spark.sql.test.SharedSparkSession
class SparkTestSpec extends SharedSparkSession {
import testImplicits._
test("join - join using") {
val df = Seq(1, 2, 3).toDF("int")
checkAnswer(df, Row(1) :: Row(2) :: Row(3) :: Nil)
}
}
And sbt config
ThisBuild / scalaVersion := "2.12.10"
val sparkVersion = "3.1.0"
val scalaTestVersion = "3.2.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion % Test,
"org.apache.spark" %% "spark-sql" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % Test,
"org.apache.spark" %% "spark-catalyst" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-hive" % sparkVersion % Test,
"org.apache.spark" %% "spark-hive" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion % Test classifier "tests",
"log4j" % "log4j" % "1.2.17",
"org.slf4j" % "slf4j-log4j12" % "1.7.30",
"org.scalatest" %% "scalatest" % scalaTestVersion % Test,
"org.scalatestplus" %% "scalacheck-1-14" % "3.2.2.0",
)
This is a very classic issue with Jackson. The error tells you that you need to have a single version of Jackson across all your dependencies but it's not the case.
Usually you have both Spark and another library pulling transitively Jackson in different versions.
What you need to do is:
run sbt dependencyTree to identify which library is pulling Jackson and which version
define a dependencyOverrides to force the same Jackson version for all Jackson libraries (which version is up to you depending on compatibility with the other libraries needing it)

Unable to start Spark application with Bahir

I am trying to run a Spark application in Scala to connect to ActiveMQ. I am using Bahir for this purpose format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider"). When I am using Bahir2.2 in my built.sbt the application is running fine but on changing it to Bahir3.0 or Bahir4.0 the application is not starting and it is giving an error:
[error] (run-main-0) java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream`
How to fix this? Is there an alternative of Bahir which I can use in my Spark-Structured-Streaming to connect to ActiveMQ topics?
EDIT:
my build.sbt
//For spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0" ,
"org.apache.spark" %% "spark-mllib" % "2.4.0" ,
"org.apache.spark" %% "spark-sql" % "2.4.0" ,
"org.apache.spark" %% "spark-hive" % "2.4.0" ,
"org.apache.spark" %% "spark-streaming" % "2.4.0" ,
"org.apache.spark" %% "spark-graphx" % "2.4.0",
)
//Bahir
libraryDependencies += "org.apache.bahir" %% "spark-sql-streaming-mqtt" % "2.4.0"
Okay, So it seems some kind of compatibility issue between spark2.4 and bahir2.4. I fixed it by rolling back both of them to ver 2.3.
Here is my build.sbt
name := "sparkTest"
version := "0.1"
scalaVersion := "2.11.11"
//For spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0" ,
"org.apache.spark" %% "spark-mllib" % "2.3.0" ,
"org.apache.spark" %% "spark-sql" % "2.3.0" ,
"org.apache.spark" %% "spark-hive" % "2.3.0" ,
"org.apache.spark" %% "spark-streaming" % "2.3.0" ,
"org.apache.spark" %% "spark-graphx" % "2.3.0",
// "org.apache.spark" %% "spark-streaming-kafka" % "1.6.3",
)
//Bahir
libraryDependencies += "org.apache.bahir" %% "spark-sql-streaming-mqtt" % "2.3.0"

Spark dependencies configuration for streaming from Twitter

I am trying to run a Spark application with a Twitter streaming. However, I constantly experiencing problems with dependencies.
When I use org.apache.bahir spark-streaming-twitter dependency I get such an error:
module not found: org.apache.bahir#spark-streaming-twitter;2.0.0
Here is the corresponding build.sbt file:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % "2.0.0",
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6"
)
But when I use older streaming dependency I get ClassNotFoundException: : org.apache.spark.Logging error.
Here is the corresponding build.sbt:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6",
"org.apache.spark" %% "spark-streaming-twitter" % "1.6.3"
)
In order to run my application, I run sbt clean and package commands.
So what dependencies should I use and how to configure them to run my application?
Twitter backend has been removed from Spark with 2.0 release and version of bahir you declared doesn't match Spark version. Finally bahir Twitter already comes with twitter4j-stream dependency (4.0.4 at this moment). Use:
val sparkVersion = "2.3.0"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion
)

How to set up a spark build.sbt file?

I have been trying all day and cannot figure out how to make it work.
So I have a common library that will be my core lib for spark.
My build.sbt file is not working:
name := "CommonLib"
version := "0.1"
scalaVersion := "2.12.5"
// addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
// resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
// resolvers += Resolver.sonatypeRepo("public")
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.hadoop" % "hadoop-common" % "2.7.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
// "org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-hive_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-yarn_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"com.github.scopt" %% "scopt" % "3.7.0"
)
//addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.6")
//libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
//libraryDependencies ++= {
// val sparkVer = "2.1.0"
// Seq(
// "org.apache.spark" %% "spark-core" % sparkVer % "provided" withSources()
// )
//}
All the commented out are all the test I've done and I don't know what to do anymore.
My goal is to have spark 2.3 to work and to have scope available too.
For my sbt version, I have 1.1.1 installed.
Thank you.
I think I had two main issues.
Spark is not compatible with scala 2.12 yet. So moving to 2.11.12 solved one issue
The second issue is that for intelliJ SBT console to reload the build.sbt you either need to kill and restart the console or use the reload command which I didnt know so I was not actually using the latest build.sbt file.
There's a Giter8 template that should work nicely:
https://github.com/holdenk/sparkProjectTemplate.g8

spark streaming with sbt + cassandra connector dependency issue

Folks,
I am trying to integrated cassandra with spark streaming. Below is the sbt file:
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1",
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.2",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.0",
("org.apache.spark" %% "spark-streaming-kafka" % "1.6.0").
exclude("org.spark-project.spark", "unused")
)
I added below line(error line mentioned below) for cassandra integration:
val lines = KafkaUtils.createDirectStream[
String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
//Getting errors once I add below line in program
lines.saveToCassandra("test", "test", SomeColumns("key", "value"))
lines.print()
Once I add above line, I see below error in IDE:
I see similar error if i try to package this project from command prompt:
FYR, I am using below versions:
scala - 2.11
kafka - kafka_2.11-0.8.2.1
java - 8
cassandra - datastax-community-64bit_2.2.8
Please help to resolve the issue.
As expected, it was dependency issue which is resolved by updating sbt file as below:
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-RC1",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.0",
("org.apache.spark" %% "spark-streaming-kafka" % "1.6.0").
exclude("org.spark-project.spark", "unused")
)