How to turn this simple Spark Streaming code into a Multi threaded one? - scala

I am learning Kafka in Scala. The attached code is just a word count implementation using Kafka and Spark Streaming.
How do I have a separate consumer execution per partition whilst streaming? Please help!
Here is my code:
class ConsumerM(topics: String, bootstrap_server: String, group_name: String) {
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
.setMaster("local[*]")
.set("spark.executor.memory","1g")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> bootstrap_server,
ConsumerConfig.GROUP_ID_CONFIG -> group_name,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
"auto.offset.reset" ->"earliest")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}

Assuming your input topic has multiple partitions, then additionally setting local[*] means you'll have one Spark executor per CPU core, and at least one partition can be consumed by each

Related

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

I am trying to count the number of words in the text and save result to the Cassandra database.
Producer reads the data from the file and sends it to kafka. Consumer uses spark streaming to read and process the date,and then sends the result of the calculations to the table.
My producer looks like this:
object ProducerPlayground extends App {
val topicName = "test"
private def createProducer: Properties = {
val producerProperties = new Properties()
producerProperties.setProperty(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092"
)
producerProperties.setProperty(
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
classOf[IntegerSerializer].getName
)
producerProperties.setProperty(
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
classOf[StringSerializer].getName
)
producerProperties
}
val producer = new KafkaProducer[Int, String](createProducer)
val source = Source.fromFile("G:\\text.txt", "UTF-8")
val lines = source.getLines()
var key = 0
for (line <- lines) {
producer.send(new ProducerRecord[Int, String](topicName, key, line))
key += 1
}
source.close()
producer.flush()
}
Consumer looks like this:
object BatchLayer {
def main(args: Array[String]) {
val brokers = "localhost:9092"
val topics = "test"
val groupId = "groupId-1"
val sparkConf = new SparkConf()
.setAppName("BatchLayer")
.setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val sc = ssc.sparkContext
sc.setLogLevel("OFF")
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false"
)
val stream =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val cass = CassandraConnector(sparkConf)
cass.withSessionDo { session =>
session.execute(
s"CREATE KEYSPACE IF NOT EXISTS batch_layer WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }"
)
session.execute(s"CREATE TABLE IF NOT EXISTS batch_layer.test (key VARCHAR PRIMARY KEY, value INT)")
session.execute(s"TRUNCATE batch_layer.test")
}
stream
.map(v => v.value())
.flatMap(x => x.split(" "))
.filter(x => !x.contains(Array('\n', '\t')))
.map(x => (x, 1))
.reduceByKey(_ + _)
.saveToCassandra("batch_layer", "test", SomeColumns("key", "value"))
ssc.start()
ssc.awaitTermination()
}
}
After starting producer, the program stops working with this error. What did I do wrong ?
It makes very little sense to use legacy streaming in 2021st - it's very cumbersome to use, and you also need to track offsets for Kafka, etc. It's better to use Structured Streaming instead - it will track offsets for your through the checkpoints, you will work with high-level Dataset APIs, etc.
In your case code could look as following (didn't test, but it's adopted from this working example):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val wordsCountsDF = streamingInputDF.selectExpr("CAST(value AS STRING) as value")
.selectExpr("split(value, '\\w+', -1) as words")
.selectExpr("explode(words) as word")
.filter("word != ''")
.groupBy($"word")
.count()
.select($"word", $"count")
// create table ...
val query = wordsCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "path_to_checkpoint)
.option("keyspace", "test")
.option("table", "<table_name>")
.start()
query.awaitTermination()
P.S. In your example, most probable error is that you're trying to use .saveToCassandra directly on DStream - it doesn't work this way.

error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]

I am trying to capture Kafka events (which I am getting in serialised form) using sparkStreaming in Scala.
Here is my code-snippet:
val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate()
spark.conf.set("spark.driver.allowMultipleContexts", "true")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val topics=Set("<topic-name>")
val brokers="<some-list>"
val groupId="spark-streaming-test"
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"auto.offset.reset" -> "earliest",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> groupId,
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages: InputDStream[ConsumerRecord[String, String]] =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
messages.foreachRDD { rdd =>
println(rdd.toDF())
}
ssc.start()
ssc.awaitTermination()
I am getting error message as:
Error:(59, 19) value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]] println(rdd.toDF())
toDF comes through DatasetHolder
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits
I haven't replicated it but my guess is that there's no encoder for ConsumerRecord[String, String] so you can either provide one or map it first to something for which an Encoder can be derived (case class or a primitive)
also println within foreachRDD will probably not act the way you want due to the distributed nature of spark

Inconsistent fetch by spark kafka consumer

I have written a code to fetch records from kafka into spark. I have come across some strange behaviour. It is consuming in inconsistent order.
val conf = new SparkConf()
.setAppName("Test Data")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.cores.max", "12")
.set("spark.driver.cores", "4")
.set("spark.ui.port", "4040")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.kafka.maxRatePerPartition", "30")
.set("spark.local.dir", "//tmp//")
.set("spark.sql.warehouse.dir", "/tmp/hive/")
.set("hive.exec.scratchdir", "/tmp/hive2")
val spark = SparkSession
.builder
.appName("Test Data")
.config(conf)
.getOrCreate()
import spark.implicits._
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val topics = Map("topictest" -> 1)
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "192.168.0.40:2181",
"group.id" -> "=groups",
"auto.offset.reset" -> "smallest")
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
}
kafkaStream.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
try {
println("Count of rows " + rdd.count())
} catch {
case e: Exception => e.printStackTrace
}
} else {
println("blank rdd")
}
})
So, Initially I produced 10 million records in kafka. Now producer is stopped and then started Spark Consumer Application. I checked Spark UI, initially I received 700,000-900,000 records per batch(every 10 seconds ) per stream, afterwards started getting 4-6K records per batch. So wanted to understand why the fetch count fell so badly despite the fact that data is present in Kafka so instead of giving 4k per batch , I'am open to consumer directly big size batch. What can be done and how ?
Thanks,

Spark Kafka Streaming multi partition CommitAsync issue

I am reading a message from Kafka topic which has multiple partitions. While reading from message no issue, while Committing the offset range to Kafka, I am getting an error. I tried my level best and not able to resolve this issue.
Code
object ParallelStreamJob {
def main(args: Array[String]): Unit = {
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
spark.sparkContext.setLogLevel("WARN")
val kafkaStream = {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "welcome3",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("test2")
val numPartitionsOfInputTopic = 2
val streams = (1 to numPartitionsOfInputTopic) map {
_ => KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) )
}
streams
}
// var offsetRanges = Array[OffsetRange]()
kafkaStream.foreach(rdd=> {
rdd.foreachRDD(conRec=> {
val offsetRanges = conRec.asInstanceOf[HasOffsetRanges].offsetRanges
conRec.foreach(str=> {
println(str.value())
for (o <- offsetRanges) {
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
})
kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
})
})
println(" Spark parallel reader is ready !!!")
ssc.start()
ssc.awaitTermination()
}
}
Error
18/03/19 21:21:30 ERROR JobScheduler: Error running job streaming job 1521512490000 ms.0
java.lang.ClassCastException: scala.collection.immutable.Vector cannot be cast to org.apache.spark.streaming.kafka010.CanCommitOffsets
at com.cts.ignite.inventory.core.ParallelStreamJob$$anonfun$main$1$$anonfun$apply$1.apply(ParallelStreamJob.scala:48)
at com.cts.ignite.inventory.core.ParallelStreamJob$$anonfun$main$1$$anonfun$apply$1.apply(ParallelStreamJob.scala:39)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.a
you can commit the offset like
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
in your case kafkaStream is Seq of stream. change you commit line.
Reference: https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
change kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) line to rdd.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)

Kafka - How to read more than 10 rows

I made connector that read from database with jdbc, and consuming it from a Spark application. The app read the database data well, BUT it read only first 10 row and seems to ignore rest of them. How should I get rest, so I can compute with all data.
Here are my spark code:
val brokers = "http://127.0.0.1:9092"
val topics = List("postgres-accounts2")
val sparkConf = new SparkConf().setAppName("KafkaWordCount")
//sparkConf.setMaster("spark://sda1:7077,sda2:7077")
sparkConf.setMaster("local[2]")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.registerKryoClasses(Array(classOf[Record]))
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
// Create direct kafka stream with brokers and topics
//val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
"schema.registry.url" -> "http://127.0.0.1:8081",
"bootstrap.servers" -> "http://127.0.0.1:9092",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "io.confluent.kafka.serializers.KafkaAvroDeserializer",
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages = KafkaUtils.createDirectStream[String, Record](
ssc,
PreferConsistent,
Subscribe[String, Record](topics, kafkaParams)
)
val data = messages.map(record => {
println( record) // print only first 10
// compute here?
(record.key, record.value)
})
data.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
I believe the issue lies in that Spark is lazy and will only read the data that is actually used.
By default, print will show the first 10 elements in a stream. Since the code does not contain any other actions in addition to the two print there is no need to read more than 10 rows of data. Try using count or another action to confirm that it is working.