No messages received when using foreachPartition spark streaming - apache-kafka

I am pulling from Kafka using Spark Streaming. When I use foreachPartition on my RDD I never get any messages received. If I read the messages from the RDD using a foreach it works fine. However I need to use the partition function so I can have a socket connection on each executor.
This is code connecting to spark and creating stream
val kafkaParams = Map(
"zookeeper.connect" -> zooKeepers,
"group.id" -> ("metric-group"),
"zookeeper.connection.timeout.ms" -> "5000")
val inputTopic = "threatflow"
val conf = new SparkConf().setAppName(applicationTitle).set("spark.eventLog.overwrite", "true")
val ssc = new StreamingContext(conf, Seconds(5))
val streams = (1 to numberOfStreams) map { _ =>
KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, Map(inputTopic -> 1), StorageLevel.MEMORY_ONLY_SER)
}
val kafkaStream = ssc.union(streams)
kafkaStream.foreachRDD { (rdd, time) =>
calcVictimsProcess(process, rdd, time.milliseconds)
}
ssc.start()
ssc.awaitTermination()
Here is my code that attempts to process the messages using foreachPartition instead of foreach
val threats = rdd.map(message => gson.fromJson(message._2.substring(1, message._2.length()), classOf[ThreatflowMessage]))
threats.flatMap(mapSrcVictim).reduceByKey((a,b) => a + b).foreachPartition{ partition =>
val socket = new Socket(InetAddress.getByName("localhost"),4242)
val writer = new BufferedOutputStream(socket.getOutputStream)
partition.foreach{ value =>
val parts = value._1.split("-")
val put = "put %s %d %d type=%s address=%s unique=%s\n".format("metric", bucket, value._2, parts(0),parts(1),unique)
Thread.sleep(10000)
}
writer.flush()
socket.close()
}
simply switching this to foreach as I said will work, however this won't work as I need to have sockets created per executor

Related

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

I am trying to count the number of words in the text and save result to the Cassandra database.
Producer reads the data from the file and sends it to kafka. Consumer uses spark streaming to read and process the date,and then sends the result of the calculations to the table.
My producer looks like this:
object ProducerPlayground extends App {
val topicName = "test"
private def createProducer: Properties = {
val producerProperties = new Properties()
producerProperties.setProperty(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092"
)
producerProperties.setProperty(
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
classOf[IntegerSerializer].getName
)
producerProperties.setProperty(
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
classOf[StringSerializer].getName
)
producerProperties
}
val producer = new KafkaProducer[Int, String](createProducer)
val source = Source.fromFile("G:\\text.txt", "UTF-8")
val lines = source.getLines()
var key = 0
for (line <- lines) {
producer.send(new ProducerRecord[Int, String](topicName, key, line))
key += 1
}
source.close()
producer.flush()
}
Consumer looks like this:
object BatchLayer {
def main(args: Array[String]) {
val brokers = "localhost:9092"
val topics = "test"
val groupId = "groupId-1"
val sparkConf = new SparkConf()
.setAppName("BatchLayer")
.setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val sc = ssc.sparkContext
sc.setLogLevel("OFF")
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false"
)
val stream =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val cass = CassandraConnector(sparkConf)
cass.withSessionDo { session =>
session.execute(
s"CREATE KEYSPACE IF NOT EXISTS batch_layer WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }"
)
session.execute(s"CREATE TABLE IF NOT EXISTS batch_layer.test (key VARCHAR PRIMARY KEY, value INT)")
session.execute(s"TRUNCATE batch_layer.test")
}
stream
.map(v => v.value())
.flatMap(x => x.split(" "))
.filter(x => !x.contains(Array('\n', '\t')))
.map(x => (x, 1))
.reduceByKey(_ + _)
.saveToCassandra("batch_layer", "test", SomeColumns("key", "value"))
ssc.start()
ssc.awaitTermination()
}
}
After starting producer, the program stops working with this error. What did I do wrong ?
It makes very little sense to use legacy streaming in 2021st - it's very cumbersome to use, and you also need to track offsets for Kafka, etc. It's better to use Structured Streaming instead - it will track offsets for your through the checkpoints, you will work with high-level Dataset APIs, etc.
In your case code could look as following (didn't test, but it's adopted from this working example):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val wordsCountsDF = streamingInputDF.selectExpr("CAST(value AS STRING) as value")
.selectExpr("split(value, '\\w+', -1) as words")
.selectExpr("explode(words) as word")
.filter("word != ''")
.groupBy($"word")
.count()
.select($"word", $"count")
// create table ...
val query = wordsCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "path_to_checkpoint)
.option("keyspace", "test")
.option("table", "<table_name>")
.start()
query.awaitTermination()
P.S. In your example, most probable error is that you're trying to use .saveToCassandra directly on DStream - it doesn't work this way.

How to turn this simple Spark Streaming code into a Multi threaded one?

I am learning Kafka in Scala. The attached code is just a word count implementation using Kafka and Spark Streaming.
How do I have a separate consumer execution per partition whilst streaming? Please help!
Here is my code:
class ConsumerM(topics: String, bootstrap_server: String, group_name: String) {
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
.setMaster("local[*]")
.set("spark.executor.memory","1g")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> bootstrap_server,
ConsumerConfig.GROUP_ID_CONFIG -> group_name,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
"auto.offset.reset" ->"earliest")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Assuming your input topic has multiple partitions, then additionally setting local[*] means you'll have one Spark executor per CPU core, and at least one partition can be consumed by each

Inconsistent fetch by spark kafka consumer

I have written a code to fetch records from kafka into spark. I have come across some strange behaviour. It is consuming in inconsistent order.
val conf = new SparkConf()
.setAppName("Test Data")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.cores.max", "12")
.set("spark.driver.cores", "4")
.set("spark.ui.port", "4040")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.kafka.maxRatePerPartition", "30")
.set("spark.local.dir", "//tmp//")
.set("spark.sql.warehouse.dir", "/tmp/hive/")
.set("hive.exec.scratchdir", "/tmp/hive2")
val spark = SparkSession
.builder
.appName("Test Data")
.config(conf)
.getOrCreate()
import spark.implicits._
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val topics = Map("topictest" -> 1)
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "192.168.0.40:2181",
"group.id" -> "=groups",
"auto.offset.reset" -> "smallest")
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
}
kafkaStream.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
try {
println("Count of rows " + rdd.count())
} catch {
case e: Exception => e.printStackTrace
}
} else {
println("blank rdd")
}
})
So, Initially I produced 10 million records in kafka. Now producer is stopped and then started Spark Consumer Application. I checked Spark UI, initially I received 700,000-900,000 records per batch(every 10 seconds ) per stream, afterwards started getting 4-6K records per batch. So wanted to understand why the fetch count fell so badly despite the fact that data is present in Kafka so instead of giving 4k per batch , I'am open to consumer directly big size batch. What can be done and how ?
Thanks,

Spark Streaming using Kafka: empty collection exception

I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
})
dStreamDst.print()
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Update:
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.
You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
rdd
else {
// rest of transformation
}
})

Only first message in Kafka stream gets processed

In Spark I create a stream from Kafka with a batch time of 5 seconds. Many messages can come in during that time and I want to process each of them individually, but it seems that with my current logic only the first message of each batch is being processed.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topics)
val messages = stream.map((x$2) => x$2._2)
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
val message = rdd.map(parse)
println(message.collect())
}
}
The parse function simply extracts relevant fields from the Json message into a tuple.
I can drill down into the partitions and process each message individually that way:
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
rdd.foreachPartition { partition =>
partition.foreach{msg =>
val message = parse(msg)
println(message)
}
}
}
}
But I'm certain there is a way to stay at the RDD level. What am I doing wrong in the first example?
I'm using spark 2.0.0, scala 2.11.8 and spark streaming kafka 0.8.
Here is the sample Streaming app which converts each message for the batch in to upper case inside for each loop and prints them. Try this sample app and then recheck your application. Hope this helps.
object SparkKafkaStreaming {
def main(args: Array[String]) {
//Broker and topic
val brokers = "localhost:9092"
val topic = "myTopic"
//Create context with 5 second batch interval
val sparkConf = new SparkConf().setAppName("SparkKafkaStreaming").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//Create direct kafka stream with brokers and topics
val topicsSet = Set[String](topic)
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val msgStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
//Message
val msg = msgStream.map(_._2)
msg.print()
//For each
msg.foreachRDD { rdd =>
if (!rdd.isEmpty) {
println("-----Convert Message to UpperCase-----")
//convert messages to upper case
rdd.map { x => x.toUpperCase() }.collect().foreach(println)
} else {
println("No Message Received")
}
}
//Start the computation
ssc.start()
ssc.awaitTermination()
}
}