Extract the time stamp from kafka messages in spark streaming? - apache-kafka

Trying to read from kafka source. I want to extract timestamp from message received to do structured spark streaming.
kafka(version 0.10.0.0)
spark streaming(version 2.0.1)

spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "your.server.com:9092")
.option("subscribe", "your-topic")
.load()
.select($"timestamp", $"value")
Field "timestamp" is what you are looking for. Type - java.sql.Timestamp. Make sure that you are connecting to 0.10 Kafka server. There is no timestamp in earlier versions.
Full list of fields described here - http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries

I'd suggest couple things:
Suppose you create a stream via latest Kafka Streaming Api (0.10 Kafka)
E.g. you use dependency: "org.apache.spark" %% "spark-streaming-kafka-0-10" % 2.0.1
Than you create a stream, according to the docs above:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker1:9092,broker2:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[ByteArrayDeserializer],
"group.id" -> "spark-streaming-test",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val sparkConf = new SparkConf()
// suppose you have 60 second window
val ssc = new StreamingContext(sparkConf, Seconds(60))
ssc.checkpoint("checkpoint")
val stream = KafkaUtils.createDirectStream(ssc, PreferConsistent,
Subscribe[String, Array[Byte]](topics, kafkaParams))
Your stream will be an DStream of ConsumerRecord[String,Array[Byte]] and you can get a timestamp and key-value as simple as:
stream.map { record => (record.timestamp(), record.key(), record.value()) }
Hope that helps.

Related

Spark Streaming - Join on multiple kafka stream operation is slow

I have 3 kafka streams having 600k+ records each, spark streaming takes more than 10 mins to process simple joins between streams.
Spark Cluster config:
This is how i'm reading kafka streams to tempviews in spark(scala)
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "KAFKASERVER")
.option("subscribe", TOPIC1)
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest").load()
.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=SCHEMA1).as("data"))
.select($"COL1", $"COL2")
.createOrReplaceTempView("TABLE1")
I join 3 TABLES using spark spark sql
select COL1, COL2 from TABLE1
JOIN TABLE2 ON TABLE1.PK = TABLE2.PK
JOIN TABLE3 ON TABLE2.PK = TABLE3.PK
Execution of Job:
Am i missing out some configuration on spark that i've to look into?
I find the same problem. And I found join between stream and stream needs more memory as I image. And the problem disappear when I increase the cores per executor.
unfortunately there wasn't any test data nor the result data that you expected to be so I could play with, so I cannot give the exact proper answer.
#Asteroid comment is valid, as we see the number of task for each stage is 1. Normally Kafka stream use receiver to consume the topic; and each receiver only create one tasks. One approach is to use multiple receivers / split partition / Increase your resources (# of core) to increase parallelism.
If this still not working, another way is to use Kafka API to createDirectStream. According to the documentation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html, this one creates an input stream that directly pulls messages from Kafka Brokers without using any receiver.
I premilinary crafted a sample code for creating direct stream below. You might want to learn about this to customize to you own preference.
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "KAFKASERVER",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"startingOffsets" -> "earliest",
"endingOffsets" -> "latest"
)
val topics = Array(TOPIC1)
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val schema = StructType(StructField('data', StringType, True))
val df = spark.createDataFrame([], schema)
val dstream = stream.map(_.value())
dstream.forEachRDD(){rdd:RDD[String], time:Time} => {
val tdf = spark.read.schema(schema).json(rdd)
df = df.union(tdf)
df.createOrReplaceTempView("TABLE1")
}
Some related materials:
https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2/ (Scroll down to Kafka Consumer Code portion. The other section is irrelevant)
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html (Spark Doc for create direct stream)
Good luck!

If Kafka Consumer fails (Spark Job), how to fetch the last offset committed by Kafka Consumer. (Scala)

Before I give any details, please note, I am NOT asking how to fetch the latest offset from Console using kafka-run-class.sh kafka.tools.ConsumerOffsetChecker.
I am trying to make a kafka consumer (kafka version 0.10) in Spark (2.3.1) using Scala (2.11.8), which will be fault tolerant. By fault tolerant, I mean, if for some reason the kafka consumer dies and restarts, it should resume consuming the messages from the last offset.
For achieving this, I commit the Kafka offset once it has been consumed using the below code,
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group_101",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean), /*because messages successfully polled by the consumer may not yet have resulted in a Spark output operation*/
"session.timeout.ms" -> (30000: java.lang.Integer),
"heartbeat.interval.ms" -> (3000: java.lang.Integer)
)
val topic = Array("topic_1")
val offsets = Map(new org.apache.kafka.common.TopicPartition("kafka_cdc_1", 0) -> 2L) /*Edit: Added code to fetch offset*/
val kstream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topic, kafkaParams, offsets) /*Edit: Added offset*/
)
kstream.foreachRDD{ rdd =>
val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
if(!rdd.isEmpty()) {
val rawRdd = rdd.map(record =>
(record.key(),record.value())).map(_._2).toDS()
val df = spark.read.schema(tabSchema).json(rawRdd)
df.createOrReplaceTempView("temp_tab")
df.write.insertInto("hive_table")
}
kstream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRange) /*Doing Async Commit Here */
}
I have tried many thing to fetch the latest Offset for the given topic, but could not get it to work.
Can anyone help me out with the scala code to achieve this, please?
Edit:
In the above code, I am trying to fetch the last offset by using
val offsets = Map(new org.apache.kafka.common.TopicPartition("kafka_cdc_1", 0) -> 2L) /*Edit: Added code to fetch offset*/
but the offset fetched by the above code is 0, not the latest. Is there anyway to fetch the latest offset?
Found the solution to the above issue. Here it is. Hope it helps someone in need.
Language: Scala, Spark Job
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group_101",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean), /*because messages successfully polled by the consumer may not yet have resulted in a Spark output operation*/
"session.timeout.ms" -> (30000: java.lang.Integer),
"heartbeat.interval.ms" -> (3000: java.lang.Integer)
)
import java.util.Properties
//create a new properties object with Kafaka Parameters as done previously. Note: Both needs to be present. We will use the proprty object just to fetch the last offset
val kafka_props = new Properties()
kafka_props.put("bootstrap.servers", "localhost:9092")
kafka_props.put("key.deserializer",classOf[StringDeserializer])
kafka_props.put("value.deserializer",classOf[StringDeserializer])
kafka_props.put("group.id","group_101")
kafka_props.put("auto.offset.reset","latest")
kafka_props.put("enable.auto.commit",(false: java.lang.Boolean))
kafka_props.put("session.timeout.ms",(30000: java.lang.Integer))
kafka_props.put("heartbeat.interval.ms",(3000: java.lang.Integer))
val topic = Array("topic_1")
/*val offsets = Map(new org.apache.kafka.common.TopicPartition("topic_1", 0) -> 2L) Edit: Added code to fetch offset*/
val topicAndPartition = new org.apache.kafka.common.TopicPartition("topic_1", 0) //Using 0 as the partition because this topic does not have any partitions
val consumer = new KafkaConsumer[String,String](kafka_props) //create a 2nd consumer to fetch last offset
import java.util
consumer.subscribe(util.Arrays.asList("topic_1")) //Subscribe to the 2nd consumer. Without this step, the offsetAndMetadata can't be fetched.
val offsetAndMetadata = consumer.committed(topicAndPartition) //Find last committed offset for the given topicAndPartition
val endOffset = offsetAndMetadata.offset().toLong //fetch the last committed offset from offsetAndMetadata and cast it to Long data type.
val fetch_from_offset = Map(new org.apache.kafka.common.TopicPartition("topic_1", 0) -> endOffset) // create a Map with data type (TopicPartition, Long)
val kstream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topic, kafkaParams, fetch_from_offset) //Pass the offset Map of datatype (TopicPartition, Long) created eariler
)
kstream.foreachRDD{ rdd =>
val offsetRange = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
if(!rdd.isEmpty()) {
val rawRdd = rdd.map(record =>
(record.key(),record.value())).map(_._2).toDS()
val df = spark.read.schema(tabSchema).json(rawRdd)
df.createOrReplaceTempView("temp_tab")
df.write.insertInto("hive_table")
}
kstream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRange) /*Doing Async offset Commit Here */
}

Spark streaming checkpoint

I am reading messages from Kafka using Spark Kafka direct streaming. I want to implement zero message loss and after restarts spark, it has to read the missed messages from Kafka. I am using checkpoint to save all read offset, so that next time spark will start read from stored offset. this is my understanding.
I have used below code. I stopped my spark and pushed few message to Kafka. After restart the spark which is not reading missed messages from Kafka. Spark reads latest messages from kafka. How to read the missed message from Kafka?
val ssc = new StreamingContext(spark.sparkContext, Milliseconds(6000))
ssc.checkpoint("C:/cp")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("test")
val ssc = new StreamingContext(spark.sparkContext, Milliseconds(50))
val msgStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Note: Application logs shows auto.offset.reset to none instead of latest. why ?
WARN KafkaUtils: overriding auto.offset.reset to none for executor
SBT
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
val connectorVersion = "2.0.7"
val kafka_stream_version = "1.6.3"
Windows : 7
If you want to read missed out messages, try commit process instead of checkpoint.
Please understand, Spark can't read old messages with property:
"auto.offset.reset" -> "latest"
Try this:
val kafkaParams = Map[String, Object](
//...
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
//...
)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
//Your processing goes here
//Then commit after completing your process.
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
Hope this helps.
I would rather suggest not to rely on checkpointing instead you can use an external data store to save your processed Kafka message offset.Please follow the link to get some insight.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/

Kafka - How to read more than 10 rows

I made connector that read from database with jdbc, and consuming it from a Spark application. The app read the database data well, BUT it read only first 10 row and seems to ignore rest of them. How should I get rest, so I can compute with all data.
Here are my spark code:
val brokers = "http://127.0.0.1:9092"
val topics = List("postgres-accounts2")
val sparkConf = new SparkConf().setAppName("KafkaWordCount")
//sparkConf.setMaster("spark://sda1:7077,sda2:7077")
sparkConf.setMaster("local[2]")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.registerKryoClasses(Array(classOf[Record]))
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
// Create direct kafka stream with brokers and topics
//val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
"schema.registry.url" -> "http://127.0.0.1:8081",
"bootstrap.servers" -> "http://127.0.0.1:9092",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "io.confluent.kafka.serializers.KafkaAvroDeserializer",
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages = KafkaUtils.createDirectStream[String, Record](
ssc,
PreferConsistent,
Subscribe[String, Record](topics, kafkaParams)
)
val data = messages.map(record => {
println( record) // print only first 10
// compute here?
(record.key, record.value)
})
data.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
I believe the issue lies in that Spark is lazy and will only read the data that is actually used.
By default, print will show the first 10 elements in a stream. Since the code does not contain any other actions in addition to the two print there is no need to read more than 10 rows of data. Try using count or another action to confirm that it is working.

Spark streaming for Kafka: How to get the topic name from Kafka consumer DStream?

I have set up the Spark-Kafka Consumer in Scala that receives messages from multiple topics:
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(10))
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> properties.getProperty("group_id"),
"auto.offset.reset" -> properties.getProperty("offset_reset")
)
// Kafka integration with receiver
val msgStream = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
ssc, kafkaParams, Map(properties.getProperty("topic1") -> 1,
properties.getProperty("topic2") -> 2,
properties.getProperty("topic3") -> 3),
StorageLevel.MEMORY_ONLY_SER).map(_._2)
I need to develop corresponding action code for messages (which will be in JSON format) from each topic.
I referred to the following question, but the answer in it didn't help me:
get topic from Kafka message in spark
So, is there any method on the received DStream that can be used to fetch topic name along with the message to determine what action should take place?
Any help on this would be greatly appreciated. Thank you.
See the code below.
You can get topic name and message by foreachRDD, map operation on DStream.
msgStream.foreachRDD(rdd => {
val pairRdd = rdd.map(i => (i.topic(), i.value()))
})
The code below is an example source of createDirectStream that I am using.
val ssc = new StreamingContext(configLoader.sparkConfig, Seconds(conf.getInt(Conf.KAFKA_PULL_INTERVAL)))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString(Conf.KAFKA_BOOTSTRAP_SERVERS),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> conf.getString(Conf.KAFKA_CONSUMER_GID),
"auto.offset.reset" -> conf.getString(Conf.KAFKA_AUTO_OFFSET_RESET),
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics: Array[String] = conf.getString(Conf.KAFKA_TOPICS).split(",")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)