Unable to write files after scala and spark upgrade - scala

My project was previously using Scala version 2.11.12 which I have upgraded to 2.12.10 and the Spark version has been upgraded from 2.4.0 to 3.1.2. See the build.sbt file below with the rest of the project dependencies and versions:
scalaVersion := "2.12.10"
val sparkVersion = "3.1.2"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
libraryDependencies += "org.xerial.snappy" % "snappy-java" % "1.1.4"
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.8"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.8" % "test, it"
libraryDependencies += "com.holdenkarau" %% "spark-testing-base" % "3.1.2_1.1.0" % "test, it"
libraryDependencies += "com.github.pureconfig" %% "pureconfig" % "0.12.1"
libraryDependencies += "com.typesafe" % "config" % "1.3.2"
libraryDependencies += "org.pegdown" % "pegdown" % "1.1.0" % "test, it"
libraryDependencies += "com.github.scopt" %% "scopt" % "3.7.1"
libraryDependencies += "com.github.pathikrit" %% "better-files" % "3.8.0"
libraryDependencies += "com.typesafe.scala-logging" %% "scala-logging" % "3.9.2"
libraryDependencies += "com.amazon.deequ" % "deequ" % "2.0.0-spark-3.1" excludeAll (
ExclusionRule(organization = "org.apache.spark")
)
libraryDependencies += "net.liftweb" %% "lift-json" % "3.4.0"
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.13.1"
The app is building fine after the upgrade but it is unable to write files to the filesystem which was working fine before the upgrade. I havent made any code changes to the write logic.
The relevant portion of code that writes to the files is shown below.
val inputStream = getClass.getResourceAsStream(resourcePath)
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val output = fs.create(new Path(outputPath))
IOUtils.copyBytes(inputStream, output.getWrappedStream, conf, true)
I am wondering if IOUtils is not compatible with the new Scala/Spark versions?

Related

scala integration tests "No such setting/task"

Integration test configuration is new to me.
I cannot get my scalatest integration tests to run (on sbt or intellij).
My unittests in src/test/scala run fine.
My integration tests are in src/it/scala
If I run with sbt it:test the error is "No such setting/task"
If I run on intellij (i.e., with the 'run' button), I get
Unable to load a Suite class. This could be due to an error in your runpath. Missing class: xxx.tools.es_ingester.EsIntegrationSpec
java.lang.ClassNotFoundException: xxx.tools.es_ingester.ConfluenceEsIntegrationSpec
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
The class, however, is clearly in /src/it/scala/xxx/tools/es_ingester.
update: build.sbt
name := "xxx.tools.data_extractor"
version := "0.1"
organization := "xxx.tools"
scalaVersion := "2.11.12"
sbtVersion := "1.2.7"
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.5"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" % "test"
libraryDependencies += "com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.9.5"
libraryDependencies += "ch.qos.logback" % "logback-core" % "1.2.3"
libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.25"
libraryDependencies += "ch.qos.logback" % "logback-classic" % "1.2.3"
libraryDependencies += "org.apache.logging.log4j" % "log4j-core" % "2.11.2"
libraryDependencies += "commons-io" % "commons-io" % "2.6"
libraryDependencies += "org.bouncycastle" % "bcprov-jdk15on" % "1.61"
libraryDependencies += "org.mockito" % "mockito-core" % "2.24.0" % Test
libraryDependencies += "com.typesafe" % "config" % "1.3.3"
libraryDependencies += "com.typesafe.play" %% "play" % "2.7.0"
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-client" % "6.6.0"
libraryDependencies += "org.elasticsearch" % "elasticsearch" % "6.6.0"
libraryDependencies += "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "6.6.0"
libraryDependencies += "org.jsoup" % "jsoup" % "1.11.3"
You have not added configuration for integration test.
For example, adding it in scala test or default settings etc.
"org.scalatest" %% "scalatest" % "3.0.5" % "it, test"
For all integration, settings refer
I hope, it will help.

How to set up a spark build.sbt file?

I have been trying all day and cannot figure out how to make it work.
So I have a common library that will be my core lib for spark.
My build.sbt file is not working:
name := "CommonLib"
version := "0.1"
scalaVersion := "2.12.5"
// addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
// resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
// resolvers += Resolver.sonatypeRepo("public")
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.hadoop" % "hadoop-common" % "2.7.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
// "org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-hive_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-yarn_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"com.github.scopt" %% "scopt" % "3.7.0"
)
//addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.6")
//libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
//libraryDependencies ++= {
// val sparkVer = "2.1.0"
// Seq(
// "org.apache.spark" %% "spark-core" % sparkVer % "provided" withSources()
// )
//}
All the commented out are all the test I've done and I don't know what to do anymore.
My goal is to have spark 2.3 to work and to have scope available too.
For my sbt version, I have 1.1.1 installed.
Thank you.
I think I had two main issues.
Spark is not compatible with scala 2.12 yet. So moving to 2.11.12 solved one issue
The second issue is that for intelliJ SBT console to reload the build.sbt you either need to kill and restart the console or use the reload command which I didnt know so I was not actually using the latest build.sbt file.
There's a Giter8 template that should work nicely:
https://github.com/holdenk/sparkProjectTemplate.g8

Why does Spark with Play fail with "NoClassDefFoundError: Could not initialize class org.apache.spark.SparkConf$"?

I am trying to use this project (https://github.com/alexmasselot/spark-play-activator) as an integration of Play and Spark example to do the same in my project. So, I created an object that starts Spark and a Controller that read a Json file using RDD. Below is my Object that starts Spark:
package bootstrap
import org.apache.spark.sql.SparkSession
object SparkCommons {
val sparkSession = SparkSession
.builder
.master("local")
.appName("ApplicationController")
.getOrCreate()
}
and my build.sbt is like this:
import play.sbt.PlayImport._
name := """crypto-miners-demo"""
version := "1.0-SNAPSHOT"
lazy val root = (project in file(".")).enablePlugins(PlayScala)
scalaVersion := "2.12.4"
libraryDependencies += guice
libraryDependencies += evolutions
libraryDependencies += jdbc
libraryDependencies += filters
libraryDependencies += ws
libraryDependencies += "com.h2database" % "h2" % "1.4.194"
libraryDependencies += "com.typesafe.play" %% "anorm" % "2.5.3"
libraryDependencies += "org.scalatestplus.play" %% "scalatestplus-play" % "3.1.0" % Test
libraryDependencies += "com.typesafe.play" %% "play-slick" % "3.0.0"
libraryDependencies += "com.typesafe.play" %% "play-slick-evolutions" % "3.0.0"
libraryDependencies += "org.xerial" % "sqlite-jdbc" % "3.19.3"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.2.0"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
But when I try to call a controller that uses the RDD I get this error on Play framework:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.SparkConf$
I am using the RDD like this: val rdd = SparkCommons.sparkSession.read.json("downloads/tweet-json").
The application that I am trying to copy the configuration is working well. I only could import the jackson-databind lib to my build.sbt. I have an error when I copy libraryDependencies ++= Dependencies.sparkAkkaHadoop and ivyScala := ivyScala.value map { _.copy(overrideScalaVersion = true) } to my build.sbt.
I will write 100000 times on the black board and never forget. Spark 2.2.0 still doesn't work with Scala 2.12. I also edited the Jackson lib version. Below is my build.sbt.
import play.sbt.PlayImport._
name := """crypto-miners-demo"""
version := "1.0-SNAPSHOT"
lazy val root = (project in file(".")).enablePlugins(PlayScala)
scalaVersion := "2.11.8"
libraryDependencies += guice
libraryDependencies += evolutions
libraryDependencies += jdbc
libraryDependencies += filters
libraryDependencies += ws
libraryDependencies += "com.h2database" % "h2" % "1.4.194"
libraryDependencies += "com.typesafe.play" %% "anorm" % "2.5.3"
libraryDependencies += "org.scalatestplus.play" %% "scalatestplus-play" % "3.1.0" % Test
libraryDependencies += "com.typesafe.play" %% "play-slick" % "3.0.0"
libraryDependencies += "com.typesafe.play" %% "play-slick-evolutions" % "3.0.0"
libraryDependencies += "org.xerial" % "sqlite-jdbc" % "3.19.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.6.5"

Spark2.1.0 incompatible Jackson versions 2.7.6

I am trying to run a simple spark example in intellij, but I get the error like that:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:819)
at spark.test$.main(test.scala:19)
at spark.test.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.7.6
at com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
I have try to update my Jackson dependency, but it seems not work, I do this:
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
but it still appear the same error messages, can some one help me to fix the error?
Here is spark example code:
object test {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("Usage: <file>")
System.exit(1)
}
val conf = new SparkConf()
val sc = new SparkContext("local","wordcount",conf)
val line = sc.textFile(args(0))
line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)
sc.stop()
}
}
And here is my built.sbt:
name := "testSpark2"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-repl_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-flume_2.10" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-network-shuffle_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-hive_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-flume-assembly_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-mesos_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-graphx_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-catalyst_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-launcher_2.11" % "2.1.0"
Spark 2.1.0 contains com.fasterxml.jackson.core as transitive dependency. So, we do not need to include then in libraryDependencies.
But if you want to add a different com.fasterxml.jackson.core dependencies' version then you have to override them. Like this:
name := "testSpark2"
version := "1.0"
scalaVersion := "2.11.8"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.7"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-repl_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-flume_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-network-shuffle_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-hive_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-flume-assembly_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-mesos_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-graphx_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-catalyst_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-launcher_2.11" % "2.1.0"
So, change your build.sbt like the one above and it will work as expected.
I hope it helps!
FYI. For my case, I'm using spark and kafka-streams in the app, while kafka-streams uses com.fasterxml.jackson.core 2.8.5. Adding exclude as below fixed the issue
(gradle)
compile (group: "org.apache.kafka", name: "kafka-streams", version: "0.11.0.0"){
exclude group:"com.fasterxml.jackson.core"
}
Solution with Gradle, by using the resolutionStrategy (https://docs.gradle.org/current/dsl/org.gradle.api.artifacts.ResolutionStrategy.html):
configurations {
all {
resolutionStrategy {
force 'com.fasterxml.jackson.core:jackson-core:2.4.4', 'com.fasterxml.jackson.core:jackson-databind:2.4.4', 'com.fasterxml.jackson.core:jackson-annotations:2.4.4'
}
}
}

Error when run jar Exception in thread "main" java.lang.NoSuchMethodError scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;

I work on spark application using (spark 2.0.0 & scala 2.11.8) and the application works fine within intellij Idea environment. I've extracted application as jar file and tried to run spark application from jar file but this error raised on terminal:
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at org.apache.spark.util.Utils$.getSystemProperties(Utils.scala:1632)
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:65)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:60)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at Main$.main(Main.scala:26)
at Main.main(Main.scala)
I've read discussions and similar question but all of them talk about different scala versions, however my sbt file is this:
name := "BaiscFM"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.0.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-graphx_2.11" % "2.0.0"
libraryDependencies += "com.typesafe.akka" % "akka-actor_2.11" % "2.4.17"
libraryDependencies += "net.liftweb" % "lift-json_2.11" % "2.6"
libraryDependencies += "com.typesafe.play" % "play-json_2.11" % "2.4.0-M2"
libraryDependencies += "org.json" % "json" % "20090211"
libraryDependencies += "org.scalaj" % "scalaj-http_2.11" % "2.3.0"
libraryDependencies += "org.drools" % "drools-core" % "6.3.0.Final"
libraryDependencies += "org.drools" % "drools-compiler" % "6.3.0.Final"
How to fix this problem?