Getting error while running Scala Spark code to list blobs in storage - scala

I am getting below error while trying to list blobs using google-cloud-storage library:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
I have tried to change version of google-cloud-storage library in build.sbt but getting the same error again and again.
import com.google.auth.oauth2.GoogleCredentials
import com.google.cloud.storage._
import com.google.cloud.storage.Storage.BlobListOption
val credentials: GoogleCredentials = GoogleCredentials.getApplicationDefault()
val storage: Storage = StorageOptions.newBuilder().setCredentials(credentials).setProjectId(projectId).build().getService()
val blobs =storage.list(bucketName, BlobListOption.currentDirectory(), BlobListOption.prefix(path))
My build.sbt looks like this:
version := "0.1"
scalaVersion := "2.11.8"
logBuffered in Test := false
libraryDependencies ++=
Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.2.0" % "provided",
"org.scalatest" %% "scalatest" % "3.0.0" % Test,
"com.typesafe" % "config" % "1.3.1",
"org.scalaj" %% "scalaj-http" % "2.4.0",
"com.google.cloud" % "google-cloud-storage" % "1.78.0"
)
Please, help me.

This is happening because Spark uses older version of Guava library than google-cloud-storage library that doesn't have Preconditions.checkArgument method. This leads to java.lang.NoSuchMethodError exception.
You can find more detailed answer and instructions on how to fix this issue here.

Related

Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0

I have a sbt projects and I want to make a test with scala test and shared spark session. Several weeks ago my project started to make an error.
java.lang.ExceptionInInitializerError
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
.....
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.0 requires Jackson Databind version >= 2.10.0 and < 2.11.0
at com.fasterxml.jackson.module.scala.JacksonModule.setupModule(JacksonModule.scala:61)
at com.fasterxml.jackson.module.scala.JacksonModule.setupModule$(JacksonModule.scala:46)
There is a very simple test
import org.apache.spark.sql.QueryTest.checkAnswer
import org.apache.spark.sql.Row
import org.apache.spark.sql.test.SharedSparkSession
class SparkTestSpec extends SharedSparkSession {
import testImplicits._
test("join - join using") {
val df = Seq(1, 2, 3).toDF("int")
checkAnswer(df, Row(1) :: Row(2) :: Row(3) :: Nil)
}
}
And sbt config
ThisBuild / scalaVersion := "2.12.10"
val sparkVersion = "3.1.0"
val scalaTestVersion = "3.2.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion % Test,
"org.apache.spark" %% "spark-sql" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-catalyst" % sparkVersion % Test,
"org.apache.spark" %% "spark-catalyst" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-hive" % sparkVersion % Test,
"org.apache.spark" %% "spark-hive" % sparkVersion % Test classifier "tests",
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion % Test classifier "tests",
"log4j" % "log4j" % "1.2.17",
"org.slf4j" % "slf4j-log4j12" % "1.7.30",
"org.scalatest" %% "scalatest" % scalaTestVersion % Test,
"org.scalatestplus" %% "scalacheck-1-14" % "3.2.2.0",
)
This is a very classic issue with Jackson. The error tells you that you need to have a single version of Jackson across all your dependencies but it's not the case.
Usually you have both Spark and another library pulling transitively Jackson in different versions.
What you need to do is:
run sbt dependencyTree to identify which library is pulling Jackson and which version
define a dependencyOverrides to force the same Jackson version for all Jackson libraries (which version is up to you depending on compatibility with the other libraries needing it)

Extracting structure failed, reason: not ok build status neo4j connector

I am facing issue when trying to import neo4j spark connector into my IntelliJ project in sbt.
My build.sbt file looks like this:
name := "shortestneo4j"
version := "0.1"
scalaVersion := "2.11.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided"
// https://mvnrepository.com/artifact/org.apache.spark/spark-graphx
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "2.4.0"
// https://mvnrepository.com/artifact/neo4j-contrib/neo4j-spark-connector
libraryDependencies += "neo4j-contrib" % "neo4j-spark-connector" % "2.4.5-M1"
I get an error from the last line. I am using Java 1.8 and Spark 2.4.0.

Apache Spark 3.1.2 can't read from S3 via documented spark-hadoop-cloud

The spark docmentation suggests using spark-hadoop-cloud to read / write from S3 in https://spark.apache.org/docs/latest/cloud-integration.html .
There is no apache spark published artifact for spark-hadoop-cloud. Then when trying to use the Cloudera published module the following exception occurs
Exception in thread "main" java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428)
This seems like a classpath conflict. Then it seems like it's not possible to use spark-hadoop-cloud to read with the vanilla apache spark 3.1.2 jars
version := "0.0.1"
scalaVersion := "2.12.12"
lazy val app = (project in file("app")).settings(
assemblyPackageScala / assembleArtifact := false,
assembly / assemblyJarName := "uber.jar",
assembly / mainClass := Some("com.example.Main"),
// more settings here ...
)
resolvers += "Cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % "3.1.1.3.1.7270.0-253"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.1.7.2.7.0-184"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.1.1.7.2.7.0-184"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"
libraryDependencies += "com.github.mrpowers" %% "spark-fast-tests" % "0.21.3" % "test"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
import org.apache.spark.sql.SparkSession
object SparkApp {
def main(args: Array[String]){
val spark = SparkSession.builder().master("local")
//.config("spark.jars.repositories", "https://repository.cloudera.com/artifactory/cloudera-repos/")
//.config("spark.jars.packages", "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
.appName("spark session").getOrCreate
val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
jsonDF.show()
csvDF.show()
}
}
To read and write to S3 from Spark you only need these 2 dependencies:
"org.apache.hadoop" % "hadoop-aws" % hadoopVersion,
"org.apache.hadoop" % "hadoop-common" % hadoopVersion
Make sure the haddopVersion is the same used by your worker nodes and make sure your workers node also have these dependencies available. The rest of your code looks correct.

how to avoid a dependency to load in spark-streaming and kafka?

I'm trying to get an example of kafka and spark-streaming working and I find problems when running the process.
this is the exception:
[error] Caused by:
com.fasterxml.jackson.databind.JsonMappingException: Incompatible
Jackson version: 2.9.8
This is the build.sbt:
name := "SparkJobs"
version := "1.0"
scalaVersion := "2.11.6"
val sparkVersion = "2.4.1"
val flinkVersion = "1.7.2"
resolvers ++= Seq(
"Typesafe Releases" at "http://repo.typesafe.com/typesafe/releases/",
"apache snapshots" at "http://repository.apache.org/snapshots/",
"confluent.io" at "http://packages.confluent.io/maven/",
"Maven central" at "http://repo1.maven.org/maven2/"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion
// ,"org.apache.flink" %% "flink-connector-kafka-0.10" % flinkVersion
, "org.apache.kafka" %% "kafka-streams-scala" % "2.2.0"
// , "io.confluent" % "kafka-streams-avro-serde" % "5.2.1"
)
//excludeDependencies ++= Seq(
// commons-logging is replaced by jcl-over-slf4j
// ExclusionRule("jackson-module-scala", "jackson-module-scala")
//
)
This is the code
Doing a sbt dependencyTree I can see that spark-core_2.11-2.4.1.jar has jackson-databind-2.6.7.1, and it is telling me that it is evicted by 2.9.8 version, which it suggest that there is a collision between libraries, but spark-core_2.11-2.4.1.jar is not the only one, kafka-streams-scala_2.11:2.2.0 uses jackson-databind-2.9.8 version, so I don't know which library has to evict jackson-databind-2.9.8. Spark-core / kafka-streams-scala? or both?
How can I avoid jackson library version 2.9.8 in order to get this task up and running?
I am assuming that I need jackson-databind-2.6.7 version ...
UPDATE with advices. Still not working.
I have deleted dependencies of kafka-stream-scala, which tries to use jackson 2.9.8, using this build.sbt
name := "SparkJobs"
version := "1.0"
scalaVersion := "2.11.6"
val sparkVersion = "2.4.1"
val flinkVersion = "1.7.2"
val kafkaStreamScala = "2.2.0"
resolvers ++= Seq(
"Typesafe Releases" at "http://repo.typesafe.com/typesafe/releases/",
"apache snapshots" at "http://repository.apache.org/snapshots/",
"confluent.io" at "http://packages.confluent.io/maven/",
"Maven central" at "http://repo1.maven.org/maven2/"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion ,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion
)
But i got new exception
UPDATE 2
got it, now i understand the second exception, i forgot to awaitToTermination.
Kafka Streams includes Jackson 2.9.8
But you don't need it when using Spark Streaming's Kafka Integration, so you should really just remove it.
Similarly, the kafka-streams-avro-serde isn't what you want to be using with Spark, rather you might find AbraOSS/ABRiS useful instead.

Spark dependencies configuration for streaming from Twitter

I am trying to run a Spark application with a Twitter streaming. However, I constantly experiencing problems with dependencies.
When I use org.apache.bahir spark-streaming-twitter dependency I get such an error:
module not found: org.apache.bahir#spark-streaming-twitter;2.0.0
Here is the corresponding build.sbt file:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % "2.0.0",
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6"
)
But when I use older streaming dependency I get ClassNotFoundException: : org.apache.spark.Logging error.
Here is the corresponding build.sbt:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6",
"org.apache.spark" %% "spark-streaming-twitter" % "1.6.3"
)
In order to run my application, I run sbt clean and package commands.
So what dependencies should I use and how to configure them to run my application?
Twitter backend has been removed from Spark with 2.0 release and version of bahir you declared doesn't match Spark version. Finally bahir Twitter already comes with twitter4j-stream dependency (4.0.4 at this moment). Use:
val sparkVersion = "2.3.0"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion
)