Receiving empty data from Kafka - Spark Streaming - apache-kafka

Why am I getting empty data messages when I read a topic from kafka?
Is it a problem with the Decoder?
*There is no error or exception.
Code:
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Queue Status")
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint("/tmp/")
val kafkaConfig = Map("zookeeper.connect" -> "ip.internal:2181",
"group.id" -> "queue-status")
val kafkaTopics = Map("queue_status" -> 1)
val kafkaStream = KafkaUtils.createStream[String, QueueStatusMessage, StringDecoder, QueueStatusMessageKafkaDeserializer](
ssc,
kafkaConfig,
kafkaTopics,
StorageLevel.MEMORY_AND_DISK)
kafkaStream.window(Minutes(1),Seconds(10)).print()
ssc.start()
ssc.awaitTermination()
}
The Kafka decoder:
class QueueStatusMessageKafkaDeserializer(props: VerifiableProperties = null) extends Decoder[QueueStatusMessage] {
override def fromBytes(bytes: Array[Byte]): QueueStatusMessage = QueueStatusMessage.parseFrom(bytes)
}
The (empty) result:
-------------------------------------------
Time: 1440010266000 ms
-------------------------------------------
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
(null,QueueStatusMessage(,,0,None,None))
Solution:
Just strictly specified the types in the Kafka topic Map:
val kafkaTopics = Map[String, Int]("queue_status" -> 1)
Still don't know the reason for the problem, but the code is working fine now.

Related

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

I am trying to count the number of words in the text and save result to the Cassandra database.
Producer reads the data from the file and sends it to kafka. Consumer uses spark streaming to read and process the date,and then sends the result of the calculations to the table.
My producer looks like this:
object ProducerPlayground extends App {
val topicName = "test"
private def createProducer: Properties = {
val producerProperties = new Properties()
producerProperties.setProperty(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092"
)
producerProperties.setProperty(
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
classOf[IntegerSerializer].getName
)
producerProperties.setProperty(
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
classOf[StringSerializer].getName
)
producerProperties
}
val producer = new KafkaProducer[Int, String](createProducer)
val source = Source.fromFile("G:\\text.txt", "UTF-8")
val lines = source.getLines()
var key = 0
for (line <- lines) {
producer.send(new ProducerRecord[Int, String](topicName, key, line))
key += 1
}
source.close()
producer.flush()
}
Consumer looks like this:
object BatchLayer {
def main(args: Array[String]) {
val brokers = "localhost:9092"
val topics = "test"
val groupId = "groupId-1"
val sparkConf = new SparkConf()
.setAppName("BatchLayer")
.setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val sc = ssc.sparkContext
sc.setLogLevel("OFF")
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false"
)
val stream =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val cass = CassandraConnector(sparkConf)
cass.withSessionDo { session =>
session.execute(
s"CREATE KEYSPACE IF NOT EXISTS batch_layer WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }"
)
session.execute(s"CREATE TABLE IF NOT EXISTS batch_layer.test (key VARCHAR PRIMARY KEY, value INT)")
session.execute(s"TRUNCATE batch_layer.test")
}
stream
.map(v => v.value())
.flatMap(x => x.split(" "))
.filter(x => !x.contains(Array('\n', '\t')))
.map(x => (x, 1))
.reduceByKey(_ + _)
.saveToCassandra("batch_layer", "test", SomeColumns("key", "value"))
ssc.start()
ssc.awaitTermination()
}
}
After starting producer, the program stops working with this error. What did I do wrong ?
It makes very little sense to use legacy streaming in 2021st - it's very cumbersome to use, and you also need to track offsets for Kafka, etc. It's better to use Structured Streaming instead - it will track offsets for your through the checkpoints, you will work with high-level Dataset APIs, etc.
In your case code could look as following (didn't test, but it's adopted from this working example):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val wordsCountsDF = streamingInputDF.selectExpr("CAST(value AS STRING) as value")
.selectExpr("split(value, '\\w+', -1) as words")
.selectExpr("explode(words) as word")
.filter("word != ''")
.groupBy($"word")
.count()
.select($"word", $"count")
// create table ...
val query = wordsCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "path_to_checkpoint)
.option("keyspace", "test")
.option("table", "<table_name>")
.start()
query.awaitTermination()
P.S. In your example, most probable error is that you're trying to use .saveToCassandra directly on DStream - it doesn't work this way.

Attempting to Write Tuple to Flink Kafka sink

I am trying to write a streaming application that both reads from and writes to Kafka. I currently have this but I have to toString my tuple class.
object StreamingJob {
def main(args: Array[String]) {
// set up the streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
properties.setProperty("group.id", "test")
val consumer = env.addSource(new FlinkKafkaConsumer08[String]("topic", new SimpleStringSchema(), properties))
val counts = consumer.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
.map { (_, 1) }
.keyBy(0)
.timeWindow(Time.seconds(5))
.sum(1)
val producer = new FlinkKafkaProducer08[String](
"localhost:9092",
"my-topic",
new SimpleStringSchema())
counts.map(_.toString()).addSink(producer)
env.execute("Window Stream WordCount")
env.execute("Flink Streaming Scala API Skeleton")
}
}
The closest I could get to getting this working was the following but the FlinkKafkaProducer08 refuses to accept the type parameter as part of the constructor.
val producer = new FlinkKafkaProducer08[(String, Int)](
"localhost:9092",
"my-topic",
new TypeSerializerOutputFormat[(String, Int)])
counts.addSink(producer)
I am wondering if there is a way to write the tuples directly to my Kafka sink.
You need a class approximately like this that serializes your tuples:
private class SerSchema extends SerializationSchema[Tuple2[String, Int]] {
override def serialize(tuple2: Tuple2[String, Int]): Array[Byte] = ...
}

Only first message in Kafka stream gets processed

In Spark I create a stream from Kafka with a batch time of 5 seconds. Many messages can come in during that time and I want to process each of them individually, but it seems that with my current logic only the first message of each batch is being processed.
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topics)
val messages = stream.map((x$2) => x$2._2)
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
val message = rdd.map(parse)
println(message.collect())
}
}
The parse function simply extracts relevant fields from the Json message into a tuple.
I can drill down into the partitions and process each message individually that way:
messages.foreachRDD { rdd =>
if(!rdd.isEmpty) {
rdd.foreachPartition { partition =>
partition.foreach{msg =>
val message = parse(msg)
println(message)
}
}
}
}
But I'm certain there is a way to stay at the RDD level. What am I doing wrong in the first example?
I'm using spark 2.0.0, scala 2.11.8 and spark streaming kafka 0.8.
Here is the sample Streaming app which converts each message for the batch in to upper case inside for each loop and prints them. Try this sample app and then recheck your application. Hope this helps.
object SparkKafkaStreaming {
def main(args: Array[String]) {
//Broker and topic
val brokers = "localhost:9092"
val topic = "myTopic"
//Create context with 5 second batch interval
val sparkConf = new SparkConf().setAppName("SparkKafkaStreaming").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
//Create direct kafka stream with brokers and topics
val topicsSet = Set[String](topic)
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val msgStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
//Message
val msg = msgStream.map(_._2)
msg.print()
//For each
msg.foreachRDD { rdd =>
if (!rdd.isEmpty) {
println("-----Convert Message to UpperCase-----")
//convert messages to upper case
rdd.map { x => x.toUpperCase() }.collect().foreach(println)
} else {
println("No Message Received")
}
}
//Start the computation
ssc.start()
ssc.awaitTermination()
}
}

Spark Streaming, kafka: java.lang.StackOverflowError

I am getting below error in spark-streaming application, i am using kafka for input stream. When i was doing with socket, it was working fine. But when i changed to kafka it's giving error. Anyone has idea why it's throwing error, do i need to change my batch time and check pointing time?
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.StackOverflowError
My program:
def main(args: Array[String]): Unit = {
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
// create a StreamingContext, the main entry point for all streaming functionality
val ssc = new StreamingContext(sc, Seconds(5))
val brokers = args(0)
val topics= args(1)
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
val inputStream = messages.map(_._2)
// val inputStream = ssc.socketTextStream(args(0), args(1).toInt)
ssc.checkpoint(checkpointDirectory)
inputStream.print(1)
val parsedStream = inputStream
.map(line => {
val splitLines = line.split(",")
(splitLines(1), splitLines.slice(2, splitLines.length).map((_.trim.toLong)))
})
import breeze.linalg.{DenseVector => BDV}
import scala.util.Try
val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
(current: Seq[Array[Long]], prev: Option[Array[Long]]) => {
prev.map(_ +: current).orElse(Some(current))
.flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
})
state.checkpoint(Duration(10000))
state.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
}
}
Try to delete the checkpoint directory.
I'm not sure but it seems that your streaming context fails to restore from the checkpoints.
anyway, it worked for me.

Writing data to cassandra using spark

I have a spark job written in Scala, in which I am just trying to write one line separated by commas, coming from Kafka producer to Cassandra database. But I couldn't call saveToCassandra.
I saw few examples of wordcount where they are writing map structure to Cassandra table with two columns and it seems working fine. But I have many columns and I found that the data structure needs to parallelized.
Here's is the sample of my code:
object TestPushToCassandra extends SparkStreamingJob {
def validate(ssc: StreamingContext, config: Config): SparkJobValidation = SparkJobValid
def runJob(ssc: StreamingContext, config: Config): Any = {
val bp_conf=BpHooksUtils.getSparkConf()
val brokers=bp_conf.get("bp_kafka_brokers","unknown_default")
val input_topics = config.getString("topics.in").split(",").toSet
val output_topic = config.getString("topic.out")
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, input_topics)
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(","))
val li = words.par
li.saveToCassandra("testspark","table1", SomeColumns("col1","col2","col3"))
li.print()
words.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach{
case x:String=>{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val outMsg=x+" from spark"
val producer = new KafkaProducer[String,String](props)
val message=new ProducerRecord[String, String](output_topic,null,outMsg)
producer.send(message)
}
}
)
)
ssc.start()
ssc.awaitTermination()
}
}
I think it's the syntax of Scala that I am not getting correct.
Thanks in advance.
You need to change your words DStream into something that the Connector can handle.
Like a Tuple
val words = lines
.map(_.split(","))
.map( wordArr => (wordArr(0), wordArr(1), wordArr(2))
or a Case Class
case class YourRow(col1: String, col2: String, col3: String)
val words = lines
.map(_.split(","))
.map( wordArr => YourRow(wordArr(0), wordArr(1), wordArr(2)))
or a CassandraRow
This is because if you place an Array there all by itself it could be an Array in C* you are trying to insert rather than 3 columns.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md