Saving data from Spark to Cassandra results in java.lang.ClassCastException - scala

I'm trying to save data from Spark to Cassandra in Scala using saveToCassandra for an RDD or save with a dataframe (both result in the same error). The full message is:
java.lang.ClassCastException:
com.datastax.driver.core.DefaultResultSetFuture cannot be cast to
com.google.common.util.concurrent.ListenableFuture
I've followed along with the code here and still seem to get the error.
I'm using Cassandra 3.6, Spark 1.6.1, and spark-cassandra-connector 1.6. Let me know if there's anything else I can provide to help with the debugging.

I had similar exception and fixed it after changing in build.sbt scala version:
scalaVersion := "2.10.6"
and library dependencies:
libraryDependencies ++= Seq(
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.2",
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
)
With this configuration example from 5-minute quick start guide works fine.

Related

Scala - Error java.lang.NoClassDefFoundError: upickle/core/Types$Writer

I'm new to Scala/Spark, so please be easy on me :)
I'm trying to run an EMR cluster on AWS, running the jar file I packed with sbt package.
When I run the code locally, it is working perfectly fine, but when I'm running it in the AWS EMR cluster, I'm getting an error:
ERROR Client: Application diagnostics message: User class threw exception: java.lang.NoClassDefFoundError: upickle/core/Types$Writer
From what I understand, this error originates in the dependencies of the scala/spark versions.
So I'm using Scala 2.12 with spark 3.0.1, and in AWS I'm using emr-6.2.0.
Here's my build.sbt:
scalaVersion := "2.12.14"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.792"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.792"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.3.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1"
libraryDependencies += "com.lihaoyi" %% "upickle" % "1.4.1"
libraryDependencies += "com.lihaoyi" %% "ujson" % "1.4.1"
What am I missing?
Thanks!
If you use sbt package, the generated jar will contain only the code of your project, but not dependencies. You need to use sbt assembly to generate so-called uberjar, that will include dependencies as well.
But in your cases, it's recommended to mark Spark and Hadoop (and maybe AWS) dependencies as Provided - they should be already included into the EMR runtime. Use something like this:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1" % Provided

Spark + Scala: How to add external dependencies in build.sbt [duplicate]

This question already has an answer here:
How to define Kafka (data source) dependencies for Spark Streaming?
(1 answer)
Closed 2 years ago.
I'm new to Spark (using v2.4.5), and am still trying to figure out the correct way to add external dependencies. When trying to add Kafka streaming to my project, my build.sbt looked liked this:
name := "Stream Handler"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.5" % "provided",
"org.apache.spark" % "spark-streaming_2.11" % "2.4.5" % "provided",
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.4.5"
)
This builds successfully, but when running with spark-submit, I get a java.lang.NoClassDefFoundError with KafkaUtils.
I was able to get my code working by passing in the dependency thru the --packages option like this:
$ spark-submit [other_args] --packages "org.apache.spark:org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.5"
Ideally I would like to set up all the dependencies in the build.sbt, but I'm not sure what I'm doing wrong. Any advice would be appreciated!
your "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.4.5" is wrong.
change that to below like mvnrepo.. https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.4.5"

Running scala spark 1.6 job on spark 2.1 fails [duplicate]

This question already has answers here:
Resolving dependency problems in Apache Spark
(7 answers)
Closed 4 years ago.
I have a spark job that needs to run nightly. However, I had to update to spark 2.1 from 1.6. Now I am receiving an error:
java.lang.NoSuchMethodError: org/apache/spark/sql/DataFrameReader.load()Lorg/apache/spark/sql/DataFrame; (loaded from file:/usr/local/src/spark21master/spark-2.1.2-bin-2.7.3/jars/spark-sql_2.11-2.1.2.jar by sun.misc.Launcher$AppClassLoader#305de464) called from class com.ibm.cit.tennis.ServiceStat$ (loaded from file:/tmp/spark-21-ego-master/work/spark-driver-8073f84b-6c09-4d7d-83f5-2c99527eaa1c/spark-service-stat_2.11-1.0.jar by org.apache.spark.util.MutableURLClassLoader#ee80a89b).
In my SBT my build file, I have the following configs:
scalaVersion := "2.11.8"
val sparkVersion = "2.1.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % sparkDependencyScope,
"org.apache.spark" %% "spark-sql" % sparkVersion % sparkDependencyScope,
"org.apache.spark" %% "spark-mllib" % sparkVersion % sparkDependencyScope,
"org.apache.spark" %% "spark-streaming" % sparkVersion % sparkDependencyScope,
"org.apache.spark" %% "spark-hive" % sparkVersion % sparkDependencyScope,
"org.apache.spark" %% "spark-repl" % sparkVersion % sparkDependencyScope
"org.apache.spark" %% "spark-graphx" % sparkVersion % sparkDependencyScope
)
I am building with Scala 2.11.8 and Java 1.8.0.
Any help would be appreciated,
Aaron.
NoSuchMethodError exception is a sign of a version mismatch. I suspect you're still trying to use Spark 1.6 to launch your app. It is not clear what is the value of sparkDependencyScope in your example. Normally Spark dependencies are specified with "provided" scope, i.e. the installed version of the Spark runtime.
"org.apache.spark" %% "spark-core" % sparkVersion % "provided"
Try running
spark-submit --version to figure out which Spark launcher version is used. If it is not what you expect, make sure Spark 2.1.2 is installed and in PATH.
The issue has been solved. The libraries were cached in an environment. After the creation of a new environment, SBT was able to pull the latest sources.
Also, I had to add:
conf.set("spark.sql.crossJoin.enabled", "true")

override guava dependency version of spark

spark depends on an old version of guava.
i build my spark project with sbt assembly, excluding spark using provided, and including the latest version of guava.
However, when running sbt-assembly, the guava dependency is excluded also from the jar.
my build.sbt:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"com.google.guava" % "guava" % "11.0"
)
if i remove the % "provided", then both spark and guava is included.
so, how can i exclude spark and include guava?
You are looking for shading options. See here but basically you need to add shading instructions. Something like this:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.guava.**" -> "my_conf.#1")
.inLibrary("com.google.guava" % "config" % "11.0")
.inProject
)
There is also the corresponding maven-shade-plugin for those who prefer maven.

Compile Spark stream kinesis sample application using SBT tool

I am trying to compile and create a jar for Spark kinesis stream scala application provided by spark itself in the given link.
Kinesis word count sample
Following is my sbt file, It has all the dependencies and its compiling fine for simple programs.
name := "stream parser"
version := "1.0"
val sparkVersion = "2.0.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming-kinesis-asl" % sparkVersion % "provided"
)
But the kinesis sample throwing following error while compiling in my Ubuntu system.
trait Logging in package internal cannot be accessed in package org.apache.spark.internal
[error] object KinesisWordCountASL extends Logging {
[error] ^
Logging class import
import org.apache.spark.internal.Logging
Any Idea what could be the problem ?
According to spark repo 2.0 and above Logging class is included in package org.apache.spark.internal spark In previous releases it is in package org.apache.spark I'd suggest run your example setting up spark 2.0.0 or later. Or make changes according usage of appropiate libraries.
Same example with build.sbt is compiling without any issues
scalaVersion := "2.10.6"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.2",
"org.apache.spark" %% "spark-sql" % "2.0.2",
"org.apache.spark" %% "spark-streaming" % "2.0.2",
"org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.0.2",
"com.amazonaws" % "amazon-kinesis-producer" % "0.10.2"
)