JodaTime issues with scala and spark when invoking spark-submit - scala

I am having trouble using JodaTime in a spark scala program. I tried the solutions posted in the past in Stackoverflow and they don't seem to fix the issue for me.
When I try to spark-submit it comes back with an error like the following:
15/09/04 17:51:57 INFO Remoting: Remoting started; listening on addresses :
[akka.tcp://sparkDriver#100.100.100.81:56672]
Exception in thread "main" java.lang.NoClassDefFoundError: org/joda/time/DateTimeZone
at com.ttams.xrkqz.GenerateCsv$.main(GenerateCsv.scala:50)
at com.ttams.xrkqz.GenerateCsv.main(GenerateCsv.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
After sbt package, which seems to work fine, I invoke spark-submit like this ...
~/spark/bin/spark-submit --class "com.ttams.xrkqz.GenerateCsv" --master local target/scala-2.10/scala-xrkqz_2.10-1.0.jar
In my build.sbt file, I have
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.1"
libraryDependencies ++= Seq ("joda-time" % "joda-time" % "2.8.2",
"org.joda" % "joda-convert" % "1.7"
)
I have tried multiple versions of joda-time and joda-convert but am not able to use spark-submit from the command line. However, it seems to work when i run within the ide (scalaide).
Let me know if you have any suggestions or ideas.

It seems that you are missing the dependencies in your class path. You could do a few things: One is you could manually add the joda time jars to spark submit with the --jars argument, the other is you could use the assembly plugin and build an assembly jar (you will likely want to mark spark-core as "provided" so it doesn't end up in your assembly) which will have all of your dependencies.

Related

What is the idiomatic way to deploy a scala binary with dependencies without using sbt-assembly?

After creating a useful application in Scala with dependencies, how do I deploy (create a binary) for it?
I would like to know the most idiomatic way which hopefully is the simplest way.
For me that would be the usual sbt compile, then look for the main class:
./target/scala-2.12/classes/scala_pandoc/Main.class
Then execute it:
$ CLASSPATH="$CLASSPATH:./target/scala-2.12/classes/" scala scala_pandoc.Main --unwrap-explain
Picked up _JAVA_OPTIONS: -Xms256m -Xmx300m
java.lang.ClassNotFoundException: ujson.Value
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at scala_pandoc.Main$.main(Main.scala:51)
at scala_pandoc.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.reflect.internal.util.ScalaClassLoader.$anonfun$run$2(ScalaClassLoader.scala:106)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:41)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:37)
at scala.reflect.internal.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:132)
at scala.reflect.internal.util.ScalaClassLoader.run(ScalaClassLoader.scala:106)
at scala.reflect.internal.util.ScalaClassLoader.run$(ScalaClassLoader.scala:98)
at scala.reflect.internal.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:132)
at scala.tools.nsc.CommonRunner.run(ObjectRunner.scala:28)
at scala.tools.nsc.CommonRunner.run$(ObjectRunner.scala:27)
at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:45)
at scala.tools.nsc.CommonRunner.runAndCatch(ObjectRunner.scala:35)
at scala.tools.nsc.CommonRunner.runAndCatch$(ObjectRunner.scala:34)
at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:45)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:73)
at scala.tools.nsc.MainGenericRunner.run$1(MainGenericRunner.scala:92)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:103)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:108)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
But as we can see it somehow does not find the dependencies. When I compile the project a bunch of files are downloaded/created at ~/.sbt/ and ~/.ivy2 but neither adding those (or all sub folders to CLASSPATH) solves the issue.
The forementioned procedure works for projects without external dependencies.
Workaround:
Use https://github.com/sbt/sbt-assembly which is great (creates an executable .jar) which I can run with java -jar myjar.jar but feels hackish/non official/fragile and besides, it also puts more dependencies on my project.
build.sbt:
lazy val scalatest = "org.scalatest" %% "scalatest" % "3.0.5"
lazy val ujson = "com.lihaoyi" %% "ujson" % "0.7.1"
name := "scala_pandoc"
organization := "org.fmv1992"
licenses += "GPLv2" -> url("https://www.gnu.org/licenses/gpl-2.0.html")
lazy val commonSettings = Seq(
version := "0.0.1-SNAPSHOT",
scalaVersion := "2.12.8",
pollInterval := scala.concurrent.duration.FiniteDuration(50L, "ms"),
maxErrors := 10,
// This final part makes test artifacts being only importable by the test files
// libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" % Test,
// ↑↑↑↑↑
// Removed on commit 'cd9d482' to enable 'trait ScalaInitiativesTest' define
// 'namedTest'.
libraryDependencies ++= Seq(scalatest, ujson),
scalacOptions ++= Seq("-feature", "-deprecation", "-Xfatal-warnings")
)
lazy val root = (project in file(".")).settings(commonSettings).settings(assemblyJarName in assembly := "scala_pandoc.jar")
project/build.properties:
sbt.version=1.2.8
project/plugins.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")
Related question: Deploy Scala binaries without dependencies
sbt-assembly is not hacky - it is maintained by one of sbt creators (Eugene Yokota) and lives in official sbt organization, so it is the official way of deploying Scala JARs in sbt.
Well, one of several official ways. You can take a look at sbt-native-packager. The thing is: there are so many possible targets of sbt build that authors decided that even building an uberjar should not be a special snowflake and it should be done via a plugin.
So just use sbt-assembly and don't feel guilty about it. That is the idiomatic way .

How to load RDDs from S3 files from spark-shell?

I have a text file in S3 that I would like to load into an RDD with spark-shell.
I have downloaded Spark 2.3.0 for Hadoop. Naively, I would expect that I just need to set the hadoop settings and I'd be set.
val inFile = "s3a://some/path"
val accessKey = "some-access-key"
val secretKey = "some-secret-key"
sc.hadoopConfiguration.set("fs.s3a.access.key", accessKey)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.textFile(inFile).count()
println(run())
Invoking the final line returns:
Failure(java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found)
This seems to be asking that I provide the library which includes S3AFileSystem. No problem - I download the appropriate jar and add this line to the beginning of the script.
:require C:\{path-to-jar}\hadoop-aws-3.1.0.jar
Now, running the script fails at the final line with a variety of errors similar to this:
error: error while loading Partition, class file 'C:\spark\spark-2.3.0-bin-hadoop2.7\jars\spark-core_2.11-2.3.0.jar(org/apache/spark/Partition.class)' has location not matching its contents: contains class Partition
I'm lost at this point - clearly, it had no issue defining the run method before.
I can access the Partition class myself directly, but something is happening above that prevents the code from accessing Partition.
scala> new org.apache.spark.Partition {def index = 3}
res6: org.apache.spark.Partition = $anon$1#3
Curiously, running the final line of the script yields a different error in subsequent invocations.
scala> sc.textFile(inFile).count()
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
at java.lang.ClassLoader.defineClass1(Native Method)
...
The documentation claims this is part of hadoop 3.1.0, which I'm using, but when exploring hadoop-aws-3.1.0.jar I see no trace of StreamCapabilities.
Is there a different jar I should be using? Am I trying to solve this problem incorrectly? Or, have I fallen into the XY problem trap?
Answers I tried
The official docs assume I'm running the script on a cluster. But I'm running spark-shell locally.
This other StackOverflow question is for an older problem. I'm using s3a as a result, but am encountering a different problem.
I also tried using every jar of Hadoop from 2.6 to 3.1, to no avail.
org.apache.hadoop.fs.StreamCapabilities is in hadoop-common-3.1.jar
You are probably mixing version of Hadoop JARs, which, as coved in the s3a troubleshooting docs is doomed.
Spark shell works fine with the right JARs in. But ASF Spark releases don't work with Hadoop 3.x yet, due to some outstanding issues. Stick to Hadoop 2.8.x and you'll get good S3 performance without so much pain.
I found a path that fixed the issue, but I have no idea why.
Create an SBT IntelliJ project
Include the below dependencies and overrides
Run the script (sans require statement) from sbt console
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.1.0"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.7"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.7"
The key part, naturally, is overriding the jackson dependencies.

sryza/spark-timeseries: NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;

I have a Scala project that I build with sbt. It uses the sryza/spark-timeseries library.
I am trying to run the following simple code:
val tsAirPassengers = new DenseVector(Array(
112.0,118.0,132.0,129.0,121.0,135.0,148.0,148.0,136.0,119.0,104.0,118.0,115.0,126.0,
141.0,135.0,125.0,149.0,170.0,170.0,158.0,133.0,114.0,140.0,145.0,150.0,178.0,163.0,
172.0,178.0,199.0,199.0,184.0,162.0,146.0,166.0,171.0,180.0,193.0,181.0,183.0,218.0,
230.0,242.0,209.0,191.0,172.0,194.0,196.0,196.0,236.0,235.0,229.0,243.0,264.0,272.0,
237.0,211.0,180.0,201.0,204.0,188.0,235.0,227.0,234.0,264.0,302.0,293.0,259.0,229.0,
203.0,229.0,242.0,233.0,267.0,269.0,270.0,315.0,364.0,347.0,312.0,274.0,237.0,278.0,
284.0,277.0,317.0,313.0,318.0,374.0,413.0,405.0,355.0,306.0,271.0,306.0,315.0,301.0,
356.0,348.0,355.0,422.0,465.0,467.0,404.0,347.0,305.0,336.0,340.0,318.0,362.0,348.0,
363.0,435.0,491.0,505.0,404.0,359.0,310.0,337.0,360.0,342.0,406.0,396.0,420.0,472.0,
548.0,559.0,463.0,407.0,362.0,405.0,417.0,391.0,419.0,461.0,472.0,535.0,622.0,606.0,
508.0,461.0,390.0,432.0
))
val period = 12
val model = HoltWinters.fitModel(tsAirPassengers, period, "additive", "BOBYQA")
It builds fine, but when I try to run it, I get this error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;
at com.cloudera.sparkts.models.HoltWintersModel.convolve(HoltWinters.scala:252)
at com.cloudera.sparkts.models.HoltWintersModel.initHoltWinters(HoltWinters.scala:277)
at com.cloudera.sparkts.models.HoltWintersModel.getHoltWintersComponents(HoltWinters.scala:190)
.
.
.
The error occurs on this line:
val model = HoltWinters.fitModel(tsAirPassengers, period, "additive", "BOBYQA")
My build.sbt includes:
name := "acme-project"
version := "0.0.1"
scalaVersion := "2.10.5"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-hive" % "1.6.0",
"net.liftweb" %% "lift-json" % "2.5+",
"com.github.seratch" %% "awscala" % "0.3.+",
"org.apache.spark" % "spark-mllib_2.10" % "1.6.2"
)
I have placed sparkts-0.4.0-SNAPSHOT.jar in the lib folder of my project. (I would have preferred to add a libraryDependency, but spark-ts does not appear to be on Maven Central.)
What is causing this run-time error?
The library requires Scala 2.11, not 2.10, and Spark 2.0, not 1.6.2, as you can see from
<scala.minor.version>2.11</scala.minor.version>
<scala.complete.version>${scala.minor.version}.8</scala.complete.version>
<spark.version>2.0.0</spark.version>
in pom.xml. You can try changing these and seeing if it still compiles, find which older version of sparkts is compatible with your versions, or update your project's Scala and Spark versions (don't miss spark-mllib_2.10 in this case).
Also, if you put the jar into lib folder, you also have to put its dependencies there (and their dependencies, etc.) or into libraryDependencies. Instead, publish sparkts into your local repository using mvn install (IIRC) and add it to libraryDependencies, which will allow SBT to resolve its dependencies.

Using s3n:// with spark-submit

I've written a Spark app that is to run on a cluster using spark-submit. Here's part of my build.sbt.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1" % "provided" exclude("asm", "asm")
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
asm is excluded because I'm using another library which depends on a different version of it. The asm dependency in Spark seems to come from one of Hadoop's dependents, and I'm not using the functionality.
The problem now is that with this setup, saveToTextFile("s3n://my-bucket/dir/file") throws java.io.IOException: No FileSystem for scheme: s3n.
Why is this happening? Shouldn't spark-submit provide the Hadoop dependencies?
I've tried a few things:
Leaving out "provided"
Putting hadoop-aws on the classpath, via a jar and spark.executor.extraClassPath and spark.driver.extraClassPath. This requires doing the same for all of its transitive dependencies though, which can be painful.
Neither really works. Is there a better approach?
I'm using the pre-built spark-1.6.1-bin-hadoop2.6.

Exception on spark test

i build tests on my Scala Spark App, but i get the exception below on Intellij while running the test. Other tests, without SparkContext are running fine. If i run the test on the terminal with "sbt test-only" the tests with sparkcontext works? Need i to specially configure intellij for tests with sparkcontext?
An exception or error caused a run to abort: org.apache.spark.rdd.ShuffledRDD.(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;)V
java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD.(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;)V
at org.apache.spark.graphx.impl.RoutingTableMessageRDDFunctions.copartitionWithVertices(RoutingTablePartition.scala:36)
at org.apache.spark.graphx.VertexRDD$.org$apache$spark$graphx$VertexRDD$$createRoutingTables(VertexRDD.scala:457)
at org.apache.spark.graphx.VertexRDD$.fromEdges(VertexRDD.scala:440)
at org.apache.spark.graphx.impl.GraphImpl$.fromEdgeRDD(GraphImpl.scala:336)
at org.apache.spark.graphx.impl.GraphImpl$.fromEdgePartitions(GraphImpl.scala:282)
at org.apache.spark.graphx.GraphLoader$.edgeListFile(GraphLoader.scala:91)
The most likely problem is that spark-core version doesn't match.
Check your sbt file to make sure using the corresponding spark core version you have.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
libraryDependencies += "org.apache.spark" %% "spark-graphx" %"1.1.0"