I am trying to check with testcontainers a streaming pipeline as a integration test but I donĀ“t know how get bootstrapServers, at least in last testcontainers version and create a specific topic there. How can I use 'containerDef' to extract bootstrapservers and add a topic?
import com.dimafeng.testcontainers.{ContainerDef, KafkaContainer}
import com.dimafeng.testcontainers.scalatest.TestContainerForAll
import munit.FunSuite
import org.apache.spark.sql.SparkSession
class Mykafkatest extends FunSuite with TestContainerForAll {
//val kafkaContainer: KafkaContainer = KafkaContainer("confluentinc/cp-kafka:5.4.3")
override val containerDef: ContainerDef = KafkaContainer.Def()
test("do something")(withContainers { container =>
val sparkSession: SparkSession = SparkSession
.builder()
.master("local[*]")
.appName("Unit testing")
.getOrCreate()
// How add a topic in that container?
// This is not posible:
val servers=container.bootstrapServers
val df = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "topic1")
.load()
df.show(false)
})
}
My sbt configuration:
lazy val root = project
.in(file("./pipeline"))
.settings(
organization := "org.example",
name := "spark-stream",
version := "0.1",
scalaVersion := "2.12.10",
libraryDependencies := Seq(
"org.apache.spark" %% "spark-sql-kafka-0-10" % "3.0.3" % Compile,
"org.apache.spark" %% "spark-sql" % "3.0.3" % Compile,
"com.dimafeng" %% "testcontainers-scala-munit" % "0.39.5" % Test,
"org.dimafeng" %% "testcontainers-scala-kafka" % "0.39.5" % Test,
"org.scalameta" %% "munit" % "0.7.28" % Test
),
testFrameworks += new TestFramework("munit.Framework"),
Test / fork := true
)
Documentation does not show a complete example: https://www.testcontainers.org/modules/kafka/
The only problem here is that you are explicitly casting that KafkaContainer.Def to ContainerDef.
The type of container provided by withContianers, Containter is decided by path dependent type in provided ContainerDef,
trait TestContainerForAll extends TestContainersForAll { self: Suite =>
val containerDef: ContainerDef
final override type Containers = containerDef.Container
override def startContainers(): containerDef.Container = {
containerDef.start()
}
// inherited from TestContainersSuite
def withContainers[A](runTest: Containers => A): A = {
val c = startedContainers.getOrElse(throw IllegalWithContainersCall())
runTest(c)
}
}
trait ContainerDef {
type Container <: Startable with Stoppable
protected def createContainer(): Container
def start(): Container = {
val container = createContainer()
container.start()
container
}
}
The moment you explicitly specify the type ContainerDef in override val containerDef: ContainerDef = KafkaContainer.Def(), this breaks the whole "type trickery" and thus Scala compiler is left with a type Container <: Startable with Stoppable instead of a KafkaContainer.
So, just remove that explicit type cast to ContainerDef, and that val servers = container.bootstrapServers will work as expected.
import com.dimafeng.testcontainers.KafkaContainer
import com.dimafeng.testcontainers.munit.TestContainerForAll
import munit.FunSuite
class Mykafkatest extends FunSuite with TestContainerForAll {
override val containerDef = KafkaContainer.Def()
test("do something")(withContainers { container =>
//...
val servers = container.bootstrapServers
println(servers)
//...
})
}
Related
I have tried to write Spark MemoryStream Unit test case and SharedSparkSession is not importing in my Test case program.
**import org.apache.spark.sql.test.SharedSparkSession
class MemoryStreamTest extends AnyFunSuite with SharedSparkSession {
....
}**
My build.sbt file configuration below
**scalaVersion := "2.12.0"
val sparkVersion = "3.0.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-streaming" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.5" % "test"
libraryDependencies += "com.novocode" % "junit-interface" % "0.11" % "test"**
Do I need to add any other dependencies artifact or any sclatest version changes required.
The below program getting import issue for SharedSparkSession file.
**import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.test.SharedSparkSession
class MemoryStreamTest extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}**
The SharedSparkSession is an internal test utility for the Apache-Spark project and not accessible through the packages you have provided in your sbt file.
The ScalaDocs do not mention the SharedSparkSession.
You will see that the trait SharedSparkSession extends SQLTestUtils which is another testing utility.
For your unit tests it is usually sufficient to just create a local SparkSession.
See the below working code..
import module.JsValueToString
import org.apache.log4j.{Level, Logger}
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.functions.{col, concat, current_timestamp, date_format, from_json, from_unixtime, from_utc_timestamp, lit, regexp_replace, sha2, struct, to_json, to_utc_timestamp, udf}
import org.apache.spark.sql.{DataFrame, SQLContext, SparkSession}
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.scalatest.BeforeAndAfterAll
import scala.io.Source
import org.apache.log4j.Logger
import org.apache.log4j.Level
class KafkaProducerFlattenerTestCase extends AnyFunSuite with BeforeAndAfterAll {
Logger.getLogger("org").setLevel(Level.ERROR)
#transient var spark : SparkSession =_
override def beforeAll(): Unit = {
spark = SparkSession
.builder()
.appName("KafkaProducerFlattenerTestCase")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
}
override def afterAll(): Unit = {
spark.stop()
}
test("MemoryStream testcase for Flattener JSON") {
implicit val sparkSesion: SparkSession = spark
implicit val ctx = spark.sqlContext
import sparkSesion.implicits._
val input = MemoryStream[String]
val valueDf = input.toDS().selectExpr("CAST(value As STRING)")
val df2 = (valueDf.select(to_json(struct(col("*"))).alias("content")))
df2.printSchema()
print(" Before Write Stream")
val formatName = ("memory")
val query = df2.writeStream
.queryName("testCustomSinkBasic")
.format(formatName)
.start()
val jasonContent = readJson()a
input.addData(jasonContent)
assert(query.isActive === true)
query.processAllAvailable()
assert(query.exception === None)
print("query....... "+query.runId)
val eventName = spark.sql("select * from testCustomSinkBasic")
val actualValString = JsValueToString(eventName)
println("actualValString..... "+actualValString)
assert(actualValString === expectValue())
}
def readJson(): String ={
val fileContents = Source.fromFile("src/resources/Json.txt").getLines().mkString
fileContents
}
def expectValue(): String = {
val expectVal = """{"publishTime":"123","name[0].lastname":"def","name[0].fname":"abc","name[1].lastname":"jkl","name[1].fname":"ghi","lpid":"1234"}"""
expectVal
}
}
Expected class to cover
import com.usb.transformation.JsFlattener
import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.functions.{col, struct, to_json}
import play.api.libs.json.{JsObject, JsValue, Json}
object JsValueToString extends Serializable{
var df3 : String = null
def apply(eventName : DataFrame): (String) ={
eventName.foreach(x => {
val content = x.getAs[String]("content").replace("\\", "")
val subStr = content.substring(10, content.length()-2)
println("content ...."+content)
println("subString "+subStr)
val str2Json: JsValue = Json.parse(subStr)
df3 = JsFlattener(str2Json).as[JsObject].toString
println("df3 Value......"+df3)
})
df3
}
}
In this project, I'm trying to consume data from a Kafka topic using Flink and then process the stream to detect a pattern using Flink CEP.
The part of using Kafka connect works and data is being fetched, but the CEP part doesn't work for some reason.
I'm using scala in this project.
build.sbt:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % "1.12.2"
libraryDependencies += "org.apache.kafka" %% "kafka" % "2.3.0"
libraryDependencies += "org.apache.flink" %% "flink-connector-kafka" % "1.12.2"
libraryDependencies += "org.apache.flink" %% "flink-cep-scala" % "1.12.2"
the main code:
import org.apache.flink.api.common.serialization.SimpleStringSchema
import java.util
import java.util.Properties
import org.apache.flink.cep.PatternSelectFunction
import org.apache.flink.cep.scala.CEP
import org.apache.flink.streaming.api.scala._
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.cep.pattern.conditions.IterativeCondition
object flinkExample {
def main(args: Array[String]): Unit = {
val CLOSE_THRESHOLD: Double = 140.00
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
properties.setProperty("group.id", "test")
val consumer = new FlinkKafkaConsumer[String]("test", new SimpleStringSchema(), properties)
consumer.setStartFromEarliest
val see: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val src: DataStream[String] = see.addSource(consumer)
val keyedStream: DataStream[Stock] = src.map(v => v)
.map {
v =>
val data = v.split(":")
val date = data(0)
val close = data(1).toDouble
Stock(date,close)
}
val pat = Pattern
.begin[Stock]("start")
.where(_.Adj_Close > CLOSE_THRESHOLD)
val patternStream = CEP.pattern(keyedStream, pat)
val result = patternStream.select(
patternSelectFunction = new PatternSelectFunction[Stock, String]() {
override def select(pattern: util.Map[String, util.List[Stock]]): String = {
val data = pattern.get("first").get(0)
data.toString
}
}
)
result.print()
see.execute("ASK Flink Kafka")
}
case class Stock(date: String,
Adj_Close: Double)
{
override def toString: String = s"Stock date: $date, Adj Close: $Adj_Close"
}
}
Data coming from Kafka are in string format: "date:value"
Scala version: 2.11.12
Flink version: 1.12.2
Kafka version: 2.3.0
I'm building the project using: sbt assembly, and then deploy the jar in the flink dashboard.
With pattern.get("first") you are selecting a pattern named "first" from the pattern sequence, but the pattern sequence only has one pattern, which is named "start". Trying changing "first" to "start".
Also, CEP has to be able to sort the stream into temporal order in order to do pattern matching. You should define a watermark strategy. For processing time semantics you can use WatermarkStrategy.noWatermarks().
I have written spark job to read one file, convert the data to json and post the data to Kafka:
I tried all options like
1.Putting thread.sleep
2.changing linger.ms lesser than thread.sleep.But nothing is working out..it Just not post any thing to kafKa .I have tried producer.flush()/producer.close().No error is coming in log.But still it is just not posting any thing.
If i write a plain standalone producer to post the message to same kafka topic ,it is going without any issue.
Hence there is no issue with Kafka as such.
4.I can see my send method is getting called from log .Also at end close is getting called .No error.
Please help!!!!!!!!!!!!
Here is my Important files of the project:
build.sbt:
name := "SparkStreamingExample"
//version := "0.1"
scalaVersion := "2.11.8"
val spark="2.3.1"
val kafka="0.10.1"
// https://mvnrepository.com/artifact/org.apache.kafka/kafka
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.9.6"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.9.6"
dependencyOverrides += "com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.9.6"
// https://mvnrepository.com/artifact/com.fasterxml.jackson.dataformat/jackson-dataformat-cbor
dependencyOverrides += "com.fasterxml.jackson.dataformat" % "jackson-dataformat-cbor" % "2.9.6"
libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "2.0.0"
// https://mvnrepository.com/artifact/org.apache.kafka/kafka
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % spark
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
libraryDependencies +="com.typesafe.play" %"play-json_2.11" % "2.6.6" exclude("com.fasterxml.jackson.core","jackson-databind")
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
libraryDependencies +="com.typesafe" % "config" %"1.3.2"
MySparkKafkaProducer.scala
import java.util.Properties
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}
class MySparkKafkaProducer(createProducer: () => KafkaProducer[String, String]) extends Serializable {
/* This is the key idea that allows us to work around running into
NotSerializableExceptions. */
#transient lazy val producer = createProducer()
def send(topic: String, key: String, value: String): Future[RecordMetadata] = {
println("inside send method")
producer.send(new ProducerRecord(topic, key, value))
}
def send(topic: String, value: String)= {
// println("inside send method")
producer.send(new ProducerRecord(topic, value))
}
// producer.send(new ProducerRecord[K, V](topic, value))
}
object MySparkKafkaProducer extends Serializable {
import scala.collection.JavaConversions._
def apply(config:Properties):MySparkKafkaProducer={
val f = () =>{
val producer =new KafkaProducer[String,String](config)
sys.addShutdownHook({
println("calling Closeeeeeeeeeee")
producer.flush()
producer.close
})
producer
}
new MySparkKafkaProducer(f)
}
}
AlibababaMainJob.scala
import java.util.Properties
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.SparkSession
import com.typesafe.config.ConfigFactory
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.codehaus.jackson.map.ser.std.StringSerializer
object AlibababaMainJob {
def main(args:Array[String]) {
val ss = SparkSession.builder().master("local[*]").appName("AlibabaCoreJob").getOrCreate()
val conf = new SparkConf().setMaster("local[2]").setAppName("AlibabaCoreJob")
//val ssc = new StreamingContext(conf, Seconds(1))
// val ssc= new StreamingContext(getSparkConf(),6)
val coreJob = new AlibabaCoreJob()
val configuration = Configuration.apply(ConfigFactory.load.resolve)
implicit val rollUpProducer: Broadcast[MySparkKafkaProducer] = ss.sparkContext.broadcast(MySparkKafkaProducer(producerProps(configuration)))
println(s"==========Kafka Config======${configuration.kafka}")
coreJob.processRecordFromFile(ss, rollUpProducer)
Thread.sleep(1000)
//ssc.start()
// println(s"==========Spark context Sarted ]======${ssc.sparkContext.appName}")
/// ssc.awaitTermination()
//
//val ss = SparkSession.builder().master("local[*]").appName("AlibabaCoreJob").getOrCreate()
//Set Up kakfa Configuration:https://stackoverflow.com/questions/31590592/spark-streaming-read-and-write-on-kafka-topic
}
def producerProps(jobConfig:Configuration,extras:(String,String)*):Properties={
val p =new Properties()
p.put("bootstrap.servers",jobConfig.kafka.brokerList)
p.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer")
p.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer")
p.put("acks","all")
p.put("retries","3")
p.put("linger.ms", "1")
p
}
// coreJob.processRecordFromFile(ss,rollUpProducer)
//}
}
AlibabaCoreJob.scala
import java.util.Properties
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}
import play.api.libs.json._
import org.apache.kafka.clients.producer.ProducerConfig
class AlibabaCoreJob extends Serializable {
// implicit val transctionWrites = Json.writes[Transction]
//case class Transction(productCode:String,description:String,brand:String,category:String,unitPrice:String,ordItems:String,mode:String) extends Serializable
def processRecordFromFile(ss:SparkSession,kafkaProducer:Broadcast[MySparkKafkaProducer]):Unit={
println("Entering processRecordFromFile")
val rddFromFile = ss.sparkContext.textFile("src/main/resources/12_transactions_case_study.csv")
println("Entering loaded file")
val fileAfterHeader=rddFromFile.mapPartitionsWithIndex(
(idx,iterator)=>if(idx==0)
iterator.drop(0) else iterator)
println("Removed header")
processRdd(fileAfterHeader,kafkaProducer:Broadcast[MySparkKafkaProducer])
}
//Set Up kakfa Configuration:https://stackoverflow.com/questions/31590592/spark-streaming-read-and-write-on-kafka-topic
def processRdd(fileAfterHeader: RDD[String],kafkaProducer:Broadcast[MySparkKafkaProducer]) = {
println("Entering processRdd")
val rddList = fileAfterHeader.mapPartitions(
line => {
// println("lineeeeee>>>"+line)
line.map(x => x.split(",")).map(y => Transction(y(0), y(1), y(2), y(3), y(4), y(5), y(6))).toList.toIterator
})
rddList.foreach(lineitem=>{
// println("Entering foreach>>>>")
val jsonString:String=Json.stringify(Json.toJson(lineitem))
//val jsonString:String=lineitem.A
// println("calling kafka producer")
kafkaProducer.value.send("topic",jsonString)
// println("done calling kafka producer")
})
}
//Suppose you want to drop 1s 3 lines from file
// val da = fi.mapPartitionsWithIndex{ (id_x, iter) => if (id_x == 0) iter.drop(3) else iter }
//Create RowRDD by mapping each line to the required fields
// val rowRdd = da.map(x=>Row(x(0), x(1)))
//Map Partitions:
}
In your project, add the following dependencies: Spark-Sql,Spark-Core,Spark-Streaming,Spa-Streaming-Kafka-0-10.
You can the read the given file in a dataframe, perform any sort of processing that you would want, and then when your processing is finished, you can write the dataframe to kafka as follows
resultDF.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
You refer the doc here for further clarity.
Note that I have assumed your results of processing would be stored in a Dataframe called resultDF
I suppose some dependencies are not defined in build.sbt file.
I've added library dependencies in build.sbt file, but still I'm getting this error mentioned from title of this question. Try to search for solution on the google but couldn't find it
My spark scala source code (filterEventId100.scala) :
package com.projects.setTopBoxDataAnalysis
import java.lang.System._
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.SparkSession
object filterEventId100 extends App {
if (args.length < 2) {
println("Usage: JavaWordCount <Input-File> <Output-file>")
exit(1)
}
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
val data = spark.read.textFile(args(0)).rdd
val result = data.flatMap{line: String => line.split("\n")}
.map{serverData =>
val serverDataArray = serverData.replace("^", "::")split("::")
val evenId = serverDataArray(2)
if (evenId.equals("100")) {
val serverId = serverDataArray(0)
val timestempTo = serverDataArray(3)
val timestempFrom = serverDataArray(6)
val server = new Servers(serverId, timestempFrom, timestempTo)
val res = (serverId, server.dateDiff(server.timestampFrom, server.timestampTo))
res
}
}.reduceByKey{
case(x: Long, y: Long) => if ((x, y) != null) {
if (x > y) x else y
}
}
result.saveAsTextFile(args(1))
spark.stop
}
class Servers(val serverId: String, val timestampFrom: String, val timestampTo: String) {
val DATE_FORMAT = "yyyy-MM-dd hh:mm:ss.SSS"
private def convertStringToDate(s: String): Date = {
val dateFormat = new SimpleDateFormat(DATE_FORMAT)
dateFormat.parse(s)
}
private def convertDateStringToLong(dateAsString: String): Long = {
convertStringToDate(dateAsString).getTime
}
def dateDiff(tFrom: String, tTo: String): Long = {
val dDiff = convertDateStringToLong(tTo) - tFrom.toLong
dDiff
}
}
My build.sbt file:
name := "SetTopProject"
version := "0.1"
scalaVersion := "2.12.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-sql_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.hadoop" %% "hadoop-common" % "3.2.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-sql_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-hive_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
"org.apache.spark" %% "spark-yarn_2.12" % "2.4.3" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy")
)
I was expecting everything will be fine because
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
is defined well (without any compiler's errors) and I use spark value to define data value:
val data = spark.read.textFile(args(0)).rdd
which calls saveAsTextFile and reducedByKey functions:
val result = data.flatMap{line: String => line.split("\n")}...
}.reducedByKey {case(x: Long, y: Long) => if ((x, y) != null) {
if (x > y) x else y
}
result.saveAsTextFile(args(1))
What I should to to remove compiler errors for saveAsTextFile and reduceByKey functions calls?
Replace
val spark = SparkSession
.builder
.appName("FilterEvent100")
.getOrCreate()
val data = spark.read.textFile(args(0)).rdd
to
val conf = new SparkConf().setAppName("FilterEvent100")
val sc = new SparkContext(conf)
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()
val data = sc.textfile(args(0))
I cannot access the SparkConf in the package. But I have already import the import org.apache.spark.SparkConf. My code is:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object SparkStreaming {
def main(arg: Array[String]) = {
val conf = new SparkConf.setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext( conf, Seconds(1) )
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs_new = words.map( w => (w, 1) )
val wordsCount = pairs_new.reduceByKey(_ + _)
wordsCount.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to the terminate
}
}
The sbt dependencies are:
name := "Spark Streaming"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.5.2",
"org.apache.spark" %% "spark-streaming" % "1.5.2"
)
But the error shows that SparkConf cannot be accessed.
[error] /home/cliu/Documents/github/Spark-Streaming/src/main/scala/Spark-Streaming.scala:31: object SparkConf in package spark cannot be accessed in package org.apache.spark
[error] val conf = new SparkConf.setMaster("local[2]").setAppName("NetworkWordCount")
[error] ^
It compiles if you add parenthesis after SparkConf:
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
The point is that SparkConf is a class and not a function, so you could use class name also for scope purposes. So when you add parenthesis after the class name, you are making sure you are calling the class constructor and not the scoping functionality. Here is an example from Scala shell illustrating the difference:
scala> class C1 { var age = 0; def setAge(a:Int) = {age = a}}
defined class C1
scala> new C1
res18: C1 = $iwC$$iwC$C1#2d33c200
scala> new C1()
res19: C1 = $iwC$$iwC$C1#30822879
scala> new C1.setAge(30) // this doesn't work
<console>:23: error: not found: value C1
new C1.setAge(30)
^
scala> new C1().setAge(30) // this works
scala>
In this case you cannot omit parentheses so it should be:
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")