Compile Spark stream kinesis sample application using SBT tool - scala

I am trying to compile and create a jar for Spark kinesis stream scala application provided by spark itself in the given link.
Kinesis word count sample
Following is my sbt file, It has all the dependencies and its compiling fine for simple programs.
name := "stream parser"
version := "1.0"
val sparkVersion = "2.0.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming-kinesis-asl" % sparkVersion % "provided"
)
But the kinesis sample throwing following error while compiling in my Ubuntu system.
trait Logging in package internal cannot be accessed in package org.apache.spark.internal
[error] object KinesisWordCountASL extends Logging {
[error] ^
Logging class import
import org.apache.spark.internal.Logging
Any Idea what could be the problem ?

According to spark repo 2.0 and above Logging class is included in package org.apache.spark.internal spark In previous releases it is in package org.apache.spark I'd suggest run your example setting up spark 2.0.0 or later. Or make changes according usage of appropiate libraries.
Same example with build.sbt is compiling without any issues
scalaVersion := "2.10.6"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.2",
"org.apache.spark" %% "spark-sql" % "2.0.2",
"org.apache.spark" %% "spark-streaming" % "2.0.2",
"org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.0.2",
"com.amazonaws" % "amazon-kinesis-producer" % "0.10.2"
)

Related

Using cloudera scalats library in project using sbt

I am trying to use cloudera scalats library for time series forecasting but unable to dowload the library using sbt.
Below is build.sbt file. I can see maven repo has 0.4.0 disted version, so not sure what wrong I am doing.
Can anyone please help me to know what wrong I am doing with sbt file?
import sbt.complete.Parsers._
scalaVersion := "2.11.8"
name := "Forecast Stock Price using Spark TimeSeries library"
val sparkVersion = "1.5.2"
//resolvers ++= Seq("Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion withSources(),
"org.apache.spark" %% "spark-streaming" % sparkVersion withSources(),
"org.apache.spark" %% "spark-sql" % sparkVersion withSources(),
"org.apache.spark" %% "spark-hive" % sparkVersion withSources(),
"org.apache.spark" %% "spark-streaming-twitter" % sparkVersion withSources(),
"org.apache.spark" %% "spark-mllib" % sparkVersion withSources(),
"com.databricks" %% "spark-csv" % "1.3.0" withSources(),
"com.cloudera.sparkts" %% "sparkts" % "0.4.0"
)
Change
"com.cloudera.sparkts" %% "sparkts" % "0.4.0"
to
"com.cloudera.sparkts" % "sparkts" % "0.4.0"
sparkts is only distributed for Scala 2.11; it does not encode the Scala version in the artifact name.

Spark Streaming Kafka java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.StringDeserializer

I am using spark streaming with the Kafka integration, When i run the streaming application from my IDE in Local mode, everything works as a charm. However as soon as i submit it to the cluster i keep having the following error:
java.lang.ClassNotFoundException:
org.apache.kafka.common.serialization.StringDeserializer
I am using sbt assembly to build my project.
my sbt is as such:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0" % Provided,
"org.apache.spark" % "spark-core_2.11" % "2.2.0" % Provided,
"org.apache.spark" % "spark-streaming_2.11" % "2.2.0" % Provided,
"org.marc4j" % "marc4j" % "2.8.2",
"net.sf.saxon" % "Saxon-HE" % "9.7.0-20"
)
run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run)).evaluated
mainClass in assembly := Some("EstimatorStreamingApp")
I also tried to use the --package option
attempt 1
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0
attempt 2
--packages org.apache.spark:spark-streaming-kafka-0-10-assembly_2.11:2.2.0
All with no success. Does anyone has anything to suggest
You need to remove the "provided" flag from the Kafka dependency, as it is a dependency not provided OOTB with Spark:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.2.0" % Provided,
"org.apache.spark" % "spark-streaming_2.11" % "2.2.0" % Provided,
"org.marc4j" % "marc4j" % "2.8.2",
"net.sf.saxon" % "Saxon-HE" % "9.7.0-20"
)

Need some help in fixing the Spark streaming dependency (Scala sbt)

I am trying to run basic spark streaming example on my machine using IntelliJ, but I am unable to resolve the dependency issues.
Please help me in fixing it.
name := "demoSpark"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq("org.apache.spark"% "spark-core_2.11"%"2.1.0",
"org.apache.spark" % "spark-sql_2.10" % "2.1.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)
At the very least, all the dependencies must use the same version of Scala, not a mix of 2.10 and 2.11. You can use %% symbol in sbt to ensure the right version is selected (the one you specified in scalaVersion).
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-sql" % "2.1.0",
"org.apache.spark" %% "spark-streaming" % "2.1.0",
"org.apache.spark" %% "spark-mllib" % "2.1.0"
)

Trying to get Apache Spark working with IntelliJ

I am trying to get Apache Spark working with IntelliJ. I have created an SBT project in IntelliJ and done the following:
1. Gone to File -> Project Structure -> Libraries
2. Clicked the '+' in the middle section, clicked Maven, clicked Download Library from Maven Repository, typed text 'spark-core' and org.apache.spark:spark-core_2.11:2.2.0, which is the latest version of Spark available
I downloaded the jar files and the source code into ./lib in the project folder
3. The Spark library is now showing in the list of libraries
4. Then I right-clicked on org.apache.spark:spark-core_2.11:2.2.0 and clicked Add to Project and Add to Modules
Now when I click on Modules on the left, and then my main project folder, and then Dependencies tab on the right I can see the external library as a Maven library, but after clicking Apply, re-building the project and re-starting IntelliJ, it will not show as an external library in the project. Therefore I can't access the Spark API commands.
What am I doing wrong please? I've looked at all the documentation on IntelliJ and a hundred other sources but can't find the answer.
Also, do I also need to include the following text in the build.SBT file, as well as specifying Apache Spark as an external library dependency? I assume that I need to EITHER include the code in the build.SBT file, OR add Spark as an external dependency manually, but not both.
I included this code in my build.SBT file:
name := "Spark_example"
version := "1.0"
scalaVersion := "2.12.3"
val sparkVersion = "2.0.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion
)
I get an error: sbt.ResolveException: unresolved dependency: org.apache.spark#spark-core_2.12;2.2.0: not found
Please help! Thanks
Spark does not have builds for Scala version 2.12.x. So set the Scala version to 2.11.x
scalaVersion := "2.11.8"
val sparkVersion = "2.0.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion
)
name := "Test"
version := "0.1"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0.2.6.4.0-91"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0.2.6.4.0-91"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.2.0.2.6.4.0-91" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.2.0.2.6.4.0-91" % "runtime"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0.2.6.4.0-91" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hive-thriftserver" % "2.2.0.2.6.4.0-91" % "provided"

override guava dependency version of spark

spark depends on an old version of guava.
i build my spark project with sbt assembly, excluding spark using provided, and including the latest version of guava.
However, when running sbt-assembly, the guava dependency is excluded also from the jar.
my build.sbt:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"com.google.guava" % "guava" % "11.0"
)
if i remove the % "provided", then both spark and guava is included.
so, how can i exclude spark and include guava?
You are looking for shading options. See here but basically you need to add shading instructions. Something like this:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.guava.**" -> "my_conf.#1")
.inLibrary("com.google.guava" % "config" % "11.0")
.inProject
)
There is also the corresponding maven-shade-plugin for those who prefer maven.