How should I log from my custom Spark JAR - scala

Scala/JVM noob here that wants to understand more about logging, specifically when using Apache Spark.
I have written a library in Scala that depends upon a bunch of Spark libraries, here are my dependencies:
import sbt._
object Dependencies {
object Version {
val spark = "2.2.0"
val scalaTest = "3.0.0"
}
val deps = Seq(
"org.apache.spark" %% "spark-core" % Version.spark,
"org.scalatest" %% "scalatest" % Version.scalaTest,
"org.apache.spark" %% "spark-hive" % Version.spark,
"org.apache.spark" %% "spark-sql" % Version.spark,
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
"ch.qos.logback" % "logback-core" % "1.2.3",
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
"com.typesafe" % "config" % "1.3.2"
)
val exc = Seq(
ExclusionRule("org.slf4j", "slf4j-log4j12")
)
}
(admittedly I copied a lot of this from elsewhere).
I am able to package my code as a JAR using sbt package which I can then call from Spark by placing the JAR into ${SPARK_HOME}/jars. This is working great.
I now want to implement logging from my code so I do this:
import com.typesafe.scalalogging.Logger
/*
* stuff stuff stuff
*/
val logger : Logger = Logger("name")
logger.info("stuff")
however when I try and call my library (which I'm doing from Python, not that I think that's relevant here) I get an error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.company.package.class.function.
E : java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger$
Clearly this is because com.typesafe.scala-logging library is not in my JAR. I know I could solve this by packaging using sbt assembly but I don't want to do that because it will include all the other dependencies and cause my JAR to be enormous.
Is there a way to selectively include libraries (com.typesafe.scala-logging in this case) in my JAR? Alternatively, should I be attempting to log using another method, perhaps using a logger that is included with Spark?
Thanks to pasha701 in the comments I attempted packaging my dependencies by using sbt assembly rather than sbt package.
import sbt._
object Dependencies {
object Version {
val spark = "2.2.0"
val scalaTest = "3.0.0"
}
val deps = Seq(
"org.apache.spark" %% "spark-core" % Version.spark % Provided,
"org.scalatest" %% "scalatest" % Version.scalaTest,
"org.apache.spark" %% "spark-hive" % Version.spark % Provided,
"org.apache.spark" %% "spark-sql" % Version.spark % Provided,
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
"ch.qos.logback" % "logback-core" % "1.2.3",
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
"com.typesafe" % "config" % "1.3.2"
)
val exc = Seq(
ExclusionRule("org.slf4j", "slf4j-log4j12")
)
}
Unfortunately, even if specifying the spark dependencies as Provided my JAR went from 324K to 12M hence I opted to use println() instead. Here is my commit message:
log using println
I went with the println option because it keeps the size of the JAR small.
I trialled use of com.typesafe.scalalogging.Logger but my tests failed with error:
java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger
because that isn't provided with Spark. I attempted to use sbt assembly
instead of sbt package but this caused the size of the JAR to go from
324K to 12M, even with spark dependencies set to Provided. A 12M JAR
isn't worth the trade-off just to use scalaLogging, hence using println
instead.
I note that pasha701 suggested using log4j instead as that is provided with Spark so I shall try that next. Any advice on using log4j from Scala when writing a Spark library would be much appreciated.

As you said 'sbt assembly' will include all the dependencies into your jar.
If you want use certain two option:
Download logback-core and logback-classic and add them on --jar spark2-submit command
Specify the above deps in --packages spark2-submit option

Related

Scala - Error java.lang.NoClassDefFoundError: upickle/core/Types$Writer

I'm new to Scala/Spark, so please be easy on me :)
I'm trying to run an EMR cluster on AWS, running the jar file I packed with sbt package.
When I run the code locally, it is working perfectly fine, but when I'm running it in the AWS EMR cluster, I'm getting an error:
ERROR Client: Application diagnostics message: User class threw exception: java.lang.NoClassDefFoundError: upickle/core/Types$Writer
From what I understand, this error originates in the dependencies of the scala/spark versions.
So I'm using Scala 2.12 with spark 3.0.1, and in AWS I'm using emr-6.2.0.
Here's my build.sbt:
scalaVersion := "2.12.14"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.792"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.792"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.3.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1"
libraryDependencies += "com.lihaoyi" %% "upickle" % "1.4.1"
libraryDependencies += "com.lihaoyi" %% "ujson" % "1.4.1"
What am I missing?
Thanks!
If you use sbt package, the generated jar will contain only the code of your project, but not dependencies. You need to use sbt assembly to generate so-called uberjar, that will include dependencies as well.
But in your cases, it's recommended to mark Spark and Hadoop (and maybe AWS) dependencies as Provided - they should be already included into the EMR runtime. Use something like this:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1" % Provided

Scalatra application with spark dependency returns java.lang.NoSuchFieldError: INSTANCE due to multiple httpclient versions

I'm trying to build an Scalatra application that runs code with spark. I can actually build the fat jar with sbt-assembly and the endpoints work, but when running tests with org.scalatra.test.scalatest._ I get the following error:
*** RUN ABORTED ***
java.lang.NoSuchFieldError: INSTANCE
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:146)
at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:964)
at org.scalatra.test.HttpComponentsClient$class.createClient(HttpComponentsClient.scala:100)
at my.package.MyServletTests.createClient(MyServletTests.scala:5)
at org.scalatra.test.HttpComponentsClient$class.submit(HttpComponentsClient.scala:63)
at my.package.MyServletTests.submit(MyServletTests.scala:5)
at org.scalatra.test.Client$class.post(Client.scala:62)
at my.package.MyServletTests.post(MyServletTests.scala:5)
at org.scalatra.test.Client$class.post(Client.scala:60)
at my.package.MyServletTests.post(MyServletTests.scala:5)
...
From other sources, this seemed to be an httpclient version error, since both Scalatra and Spark use different versions. These sources suggested the use of Maven Shade Plugin to rename one of these versions. I am, however, using sbt instead of Maven. Even though sbt has a shade functionality, it works when creating the fat jar, and I need this solution during development tests.
Sources are:
Conflict between httpclient version and Apache Spark
Error while trying to use an API. java.lang.NoSuchFieldError: INSTANCE
Is there any way to resolve this kind of conflict using SBT? I'm running Eclipse Scala-IDE and these are my dependencies in built.sbt:
val scalatraVersion = "2.6.5"
// scalatra
libraryDependencies ++= Seq(
"org.scalatra" %% "scalatra" % scalatraVersion,
"org.scalatra" %% "scalatra-scalatest" % scalatraVersion % "test",
"org.scalatra" %% "scalatra-specs2" % scalatraVersion,
"org.scalatra" %% "scalatra-swagger" % scalatraVersion,
"ch.qos.logback" % "logback-classic" % "1.2.3" % "runtime",
"org.eclipse.jetty" % "jetty-webapp" % "9.4.9.v20180320" % "container;compile",
"javax.servlet" % "javax.servlet-api" % "3.1.0" % "provided"
)
// From other projects:
// spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-mllib" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0"
)
// scalatest
libraryDependencies += "org.scalatest" % "scalatest_2.11" % "3.0.5" % "test"

override guava dependency version of spark

spark depends on an old version of guava.
i build my spark project with sbt assembly, excluding spark using provided, and including the latest version of guava.
However, when running sbt-assembly, the guava dependency is excluded also from the jar.
my build.sbt:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"com.google.guava" % "guava" % "11.0"
)
if i remove the % "provided", then both spark and guava is included.
so, how can i exclude spark and include guava?
You are looking for shading options. See here but basically you need to add shading instructions. Something like this:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.guava.**" -> "my_conf.#1")
.inLibrary("com.google.guava" % "config" % "11.0")
.inProject
)
There is also the corresponding maven-shade-plugin for those who prefer maven.

Scala sbt order of imports

Is it possible to change the order of imports in sbt for scala compilation?
I have two unmanaged dependencies in the ./lib/ folder and the rest are managed dependencies in the sbt file:
libraryDependencies ++= Seq(
"org.slf4j" % "slf4j-api" % "1.7.10",
"org.slf4j" % "slf4j-simple" % "1.7.10",
"org.slf4j" % "slf4j-log4j12" % "1.7.10",
"com.typesafe" % "config" % "1.0.2",
"edu.washington.cs.knowitall.taggers" %% "taggers-core" % "0.4",
"com.rockymadden.stringmetric" % "stringmetric-core" % "0.25.3",
"org.apache.solr" % "solr-solrj" % "4.3.1",
"com.twitter" %% "util-collection" % "6.3.6",
"org.scalaj" %% "scalaj-http" % "0.3.10",
"commons-logging" % "commons-logging" % "1.2"
)
In Eclipse I can run my program, because I can change the order of imports in the java build path (I put unmanaged dependencies at the end).
However when I want to run it from the terminal:
sbt "run-main de.questionParser.Test"
I get the following error
[error] Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.collect.Lists.reverse(Ljava/util/List;)Ljava/util/List;
So the final question is:
Is it possible to change the order of imports in sbt for scala compilation so that managed dependencies are included before unmanaged dependencies?

Spark job SBT Assembly merge conflict when including Spark Streaming Kinesis ASL library

I started making a spark streaming job and got a producer up for kinesis endpoint. After getting that working I started making a consumer but I ran into problems with building it.
I am using the assembly plugin to create a single jar that contains all the dependencies. The project's dependencies are as follows.
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.4.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.4.1" % "provided",
"org.apache.spark" %% "spark-streaming-kinesis-asl" % "1.4.1",
"org.scalatest" %% "scalatest" % "2.2.1" % "test",
"c3p0" % "c3p0" % "0.9.1.+",
"com.amazonaws" % "aws-java-sdk" % "1.10.4.1",
"mysql" % "mysql-connector-java" % "5.1.33",
"com.amazonaws" % "amazon-kinesis-client" % "1.5.0"
)
When I run assembly, the files can compile but it fails during the merge phase with the error
[error] (streamingClicks/*:assembly) deduplicate: different file contents found in the following:
[error] /Users/adam/.ivy2/cache/org.apache.spark/spark-network-common_2.10/jars/spark-network-common_2.10-1.4.1.jar:META-INF/maven/com.google.guava/guava/pom.properties
[error] /Users/adam/.ivy2/cache/com.google.guava/guava/bundles/guava-18.0.jar:META-INF/maven/com.google.guava/guava/pom.properties
This is caused when adding in the spark-streaming-kinesis-asl dependency. How do I get around this? I can mark the dependency as provided but then add the jar into the classpath but that's really not something I want to do.
I don't know the accurate solution in your case but you have to play with merge strategy for these dependencies. It could be something like that:
lazy val strategy = assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { (old) => {
case "application.conf" => MergeStrategy.concat
case meta(_) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
lazy val core = (project in file("core")).
settings(
name := "core",
libraryDependencies ++= Seq(
...
),
strategy
)
P.S. I recommend also assembly all dependencies in one "fatjar" project and provide code in dependent project. Then you can put fatjar on hdfs and package your actual code with sbt package. To run your code you should provide fatjar and your package with --jar option.