IntelliJ setup with Spark Scala and Sbt - scala

I am doing spark+Scala+SBT project setup in IntelliJ.
Scala Version: 2.12.8
SBT Version: 1.4.2
Java Version: 1.8
Build.sbt file:
name := "Spark_Scala_Sbt"
version := "0.1"
scalaVersion := "2.12.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.3",
"org.apache.spark" %% "spark-sql" % "2.3.3"
)
Scala file:
import org.apache.spark.sql.SparkSession
object FirstSparkApplication extends App {
val spark = SparkSession.builder
.master("local[*]")
.appName("Sample App")
.getOrCreate()
val data = spark.sparkContext.parallelize(
Seq("I like Spark", "Spark is awesome", "My first Spark job is working now and is counting down these words")
)
val filtered = data.filter(line => line.contains("awesome"))
filtered.collect().foreach(print)
}
But its showing below error messages:
1. Cannot resolve symbol apache.
2. Cannot resolve symbol SparkSession
3. Cannot resolve symbol sparkContext
4. Cannot resolve symbol filter.
5. Cannot resolve symbol collect.
6. Cannot resolve symbol contains.
What should I change here?

Related

Spark-Kafka invalid dependency detected

I have a basic Spark - Kafka code, I try to run following code:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import java.util.regex.Pattern
import java.util.regex.Matcher
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
import Utilities._
object WordCount {
def main(args: Array[String]): Unit = {
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))
setupLogging()
// Construct a regular expression (regex) to extract fields from raw Apache log lines
val pattern = apacheLogPattern()
// hostname:port for Kafka brokers, not Zookeeper
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
// List of topics you want to listen for from Kafka
val topics = List("testLogs").toSet
// Create our Kafka stream, which will contain (topic,message) pairs. We tack a
// map(_._2) at the end in order to only get the messages, which contain individual
// lines of data.
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).map(_._2)
// Extract the request field from each log line
val requests = lines.map(x => {val matcher:Matcher = pattern.matcher(x); if (matcher.matches()) matcher.group(5)})
// Extract the URL from the request
val urls = requests.map(x => {val arr = x.toString().split(" "); if (arr.size == 3) arr(1) else "[error]"})
// Reduce by URL over a 5-minute window sliding every second
val urlCounts = urls.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(300), Seconds(1))
// Sort and print the results
val sortedResults = urlCounts.transform(rdd => rdd.sortBy(x => x._2, false))
sortedResults.print()
// Kick it off
ssc.checkpoint("/home/")
ssc.start()
ssc.awaitTermination()
}
}
I am using IntelliJ IDE, and create scala project by using sbt. Details of build.sbt file is as follow:
name := "Sample"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.4.1",
"org.apache.spark" %% "spark-streaming-kafka" % "1.4.1",
"org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
)
However, when I try to build the code, it creates following error:
Error:scalac: missing or invalid dependency detected while loading class file 'StreamingContext.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'StreamingContext.class' was compiled against an incompatible version of org.apache.spark.
Error:scalac: missing or invalid dependency detected while loading class file 'DStream.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'DStream.class' was compiled against an incompatible version of org.apache.spark.
When using different Spark libraries together the versions of all libs should always match.
Also, the version of kafka you use matters also, so should be for example: spark-streaming-kafka-0-10_2.11
...
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % sparkVersion,
"org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
)
This is a useful site if you need to check the exact dependencies you should use:
https://search.maven.org/

Spark Streaming Kafka CreateDirectStream Not Resolving

Need some help, please.
I am using IntelliJ with SBT to build my apps.
I'm working on an app to read a Kafka topic in Spark Streaming in order to do some ETL work on it. Unfortunately, I can't read from Kafka.
The KafkaUtils.createDirectStream isn't resolving and keeps giving me errors (CANNOT RESOLVE SYMBOL). I have done my research and it appears I have the correct dependencies.
Here is my build.sbt:
name := "ASUIStreaming"
version := "0.1"
scalacOptions += "-target:jvm-1.8"
scalaVersion := "2.11.11"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0"
libraryDependencies += "org.apache.kafka" %% "kafka-clients" % "0.8.2.1"
libraryDependencies += "org.scala-lang.modules" %% "scala-parser-combinators" % "1.0.4"
Any suggestions? I should also mention I don't have admin access on the laptop since this is a work computer, and I am using a portable JDK and IntelliJ installation. However, my colleagues at work are in the same situation and it works fine for them.
Thanks in advance!
Here is the main Spark Streaming code snippet I'm using.
Note: I've masked some of the confidential work data such as IP and Topic name etc.
import org.apache.kafka.clients.consumer.ConsumerRecord
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark
import org.apache.kafka.clients.consumer._
import org.apache.kafka.common.serialization.StringDeserializer
import scala.util.parsing.json._
import org.apache.spark.streaming.kafka._
object ASUISpeedKafka extends App
{
// Create a new Spark Context
val conf = new SparkConf().setAppName("ASUISpeedKafka").setMaster("local[*]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
//Identify the Kafka Topic and provide the parameters and Topic details
val kafkaTopic = "TOPIC1"
val topicsSet = kafkaTopic.split(",").toSet
val kafkaParams = Map[String, String]
(
"metadata.broker.list" -> "IP1:PORT, IP2:PORT2",
"auto.offset.reset" -> "smallest"
)
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]
(
ssc, kafkaParams, topicsSet
)
}
I was able to resolve the issue. After re-creating the project and adding all dependencies again, I found out that in Intellij certain code has to be on the same line other it won't compile.
In this case, putting val kafkaParams code on the same line (instead of in a code block) solved the issue!

Scala Exception

I am learning Scala programming to write driver program for word count in Apache Spark .I am using Windows 7 and Latest Spark version 2.2.0. While executing the program getting below mentioned error.
How to fix and get result ?
SBT
name := "sample"
version := "0.1"
scalaVersion := "2.12.3"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion,
"org.apache.spark" % "spark-sql_2.11" % sparkVersion,
"org.apache.spark" % "spark-streaming_2.11" % sparkVersion
)
Driver Program
package com.demo.file
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SparkSession
object Reader {
def main(args: Array[String]): Unit = {
println("Welcome to Reader.")
val filePath = "C:\\notes.txt"
val spark = SparkSession.builder.appName("Simple app").config("spark.master", "local")getOrCreate();
val fileData = spark.read.textFile(filePath).cache()
val count_a = fileData.filter(line => line.contains("a")).count()
val count_b = fileData.filter(line => line.contains("b")).count()
println(s" count of A $count_a and count of B $count_b")
spark.stop()
}
}
Error
Welcome to Reader.
Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class
at org.apache.spark.SparkConf$DeprecatedConfig.<init>(SparkConf.scala:723)
at org.apache.spark.SparkConf$.<init>(SparkConf.scala:571)
at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
at org.apache.spark.SparkConf.set(SparkConf.scala:92)
at org.apache.spark.SparkConf.set(SparkConf.scala:81)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6$$anonfun$apply$1.apply(SparkSession.scala:905)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6$$anonfun$apply$1.apply(SparkSession.scala:905)
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:138)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:229)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:138)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:905)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.demo.file.Reader$.main(Reader.scala:11)
at com.demo.file.Reader.main(Reader.scala)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more
Spark 2.2.0 is built and distributed to work with Scala 2.11 by default. To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.11.X). And your scala version is 2.12.X. That's why it is throwing exception.

How to create Spark/Scala project in IntelliJ IDEA (fails to resolve dependencies in build.sbt)?

I'm trying to build and run a Scala/Spark project in IntelliJ IDEA.
I have added org.apache.spark:spark-sql_2.11:2.0.0 in global libraries and my build.sbt looks like below.
name := "test"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.0"
I still get an error that says
unknown artifact. unable to resolve or indexed
under spark-sql.
When tried to build the project the error was
Error:(19, 26) not found: type sqlContext, val sqlContext = new sqlContext(sc)
I have no idea what the problem could be. How to create a Spark/Scala project in IntelliJ IDEA?
Update:
Following the suggestions I updated the code to use Spark Session, but it still unable to read a csv file. What am I doing wrong here? Thank you!
val spark = SparkSession
.builder()
.appName("Spark example")
.config("spark.some.config.option", "some value")
.getOrCreate()
import spark.implicits._
val testdf = spark.read.csv("/Users/H/Desktop/S_CR_IP_H.dat")
testdf.show() //it doesn't show anything
//pdf.select("DATE_KEY").show()
sql should upper case letters as below
val sqlContext = new SQLContext(sc)
SQLContext is deprecated for newer versions of spark so I would suggest you to use SparkSession
val spark = SparkSession.builder().appName("testings").getOrCreate
val sqlContext = spark.sqlContext
If you want to set the master through your code instead of from spark-submit command then you can set .master as well (you can set configs too)
val spark = SparkSession.builder().appName("testings").master("local").config("configuration key", "configuration value").getOrCreate
val sqlContext = spark.sqlContext
Update
Looking at your sample data
DATE|PID|TYPE
8/03/2017|10199786|O
and testing your code
val testdf = spark.read.csv("/Users/H/Desktop/S_CR_IP_H.dat")
testdf.show()
I had output as
+--------------------+
| _c0|
+--------------------+
| DATE|PID|TYPE|
|8/03/2017|10199786|O|
+--------------------+
Now adding .option for delimiter and header as
val testdf2 = spark.read.option("delimiter", "|").option("header", true).csv("/Users/H/Desktop/S_CR_IP_H.dat")
testdf2.show()
Output was
+---------+--------+----+
| DATE| PID|TYPE|
+---------+--------+----+
|8/03/2017|10199786| O|
+---------+--------+----+
Note: I have used .master("local") for SparkSession object
(That should really be part of the Spark official documentation)
Replace the following from your configuration in build.sbt:
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.0"
with the following:
// the latest Scala version that is compatible with Spark
scalaVersion := "2.11.11"
// Few changes here
// 1. Use double %% so you don't have to worry about Scala version
// 2. I doubt you need spark-core dependency
// 3. Use the latest Spark version
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
Don't worry about IntelliJ IDEA telling you the following:
unknown artifact. unable to resolve or indexed
It's just something you have to live with and the only solution I could find is to...accept the annoyance.
val sqlContext = new sqlContext(sc)
The real type is SQLContext, but as the scaladoc says:
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
Please use SparkSession instead.
The entry point to programming Spark with the Dataset and DataFrame API.
See the Spark official documentation to read on SparkSession and other goodies. Start from Getting Started. Have fun!

Run Scala Spark with SBT

The code below causes Spark to become unresponsive:
System.setProperty("hadoop.home.dir", "H:\\winutils");
val sparkConf = new SparkConf().setAppName("GroupBy Test").setMaster("local[1]")
val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val text_file = sc.textFile("h:\\data\\details.txt")
val counts = text_file
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
println(counts);
}
I'm setting hadoop.home.dir in order to avoid the error mentioned here: Failed to locate the winutils binary in the hadoop binary path
This is how my build.sbt file looks like:
lazy val root = (project in file(".")).
settings(
name := "hello",
version := "1.0",
scalaVersion := "2.11.0"
)
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "1.6.0"
)
Should Scala Spark be compilable/runnable using the sbt code in the file?
I think code is fine, it was taken verbatim from http://spark.apache.org/examples.html, but I am not sure if the Hadoop WinUtils path is required.
Update: "The solution was to use fork := true in the main build.sbt"
Here is the reference: Spark: ClassNotFoundException when running hello world example in scala 2.11
This is the content of my build.sbt. Notice that if your internet connection is slow it might take some time.
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.1",
"org.apache.spark" %% "spark-mllib" % "1.6.1",
"org.apache.spark" %% "spark-sql" % "1.6.1",
"org.slf4j" % "slf4j-api" % "1.7.12"
)
run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
In the main I added this, however it depends on where you placed the winutil folder.
System.setProperty("hadoop.home.dir", "c:\\winutil")