Apache Kafka: How to receive latest message from Kafka? - scala

I am consuming and processing messages in the Kafka consumer application using Spark in Scala. Sometimes it takes little more time than usual to process messages from Kafka message queue. At that time I need to consume latest message, ignoring the earlier ones which have been published by the producer and yet to be consumed.
Here is my consumer code:
object KafkaSparkConsumer extends MessageProcessor {
def main(args: scala.Array[String]): Unit = {
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(1))
val group_id = Random.alphanumeric.take(4).mkString("dfhSfv")
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> group_id,
"auto.offset.reset" -> properties.getProperty("offset_reset"),
"zookeeper.session.timeout" -> properties.getProperty("zookeeper_timeout"))
val msgStream = KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaParams,
Map("moved_object" -> 1),
StorageLevel.MEMORY_ONLY_SER
).map(_._2)
msgStream.foreachRDD { x =>
x.foreach {
msg => println("Message: "+msg)
processMessage(msg)
}
}
ssc.start()
ssc.awaitTermination()
}
}
Is there any way to make sure the consumer always gets the most recent message in the consumer application? Or do I need to set any property in Kafka configuration to achieve the same?
Any help on this would be greatly appreciated. Thank you

Kafka consumer api include method
void seekToEnd(Collection<TopicPartition> partitions)
So, you can get assigned partitions from consumer and seek for all of them to the end. There is similar method to seekToBeginning.

You can leverage two KafkaConsumer APIs to get the very last message from a partition (assuming log compaction won't be an issue):
public Map<TopicPartition, Long> endOffsets(Collection<TopicPartition> partitions): This gives you the end offset of the given partitions. Note that the end offset is the offset of the next message to be delivered.
public void seek(TopicPartition partition, long offset): Run this for each partition and provide its end offset from above call minus 1 (assuming it's greater than 0).

You can always generate a new (random) group id when connecting to Kafka - that way you will start consuming new messages when you connect.

Yes, you can set staringOffset to latest to consume latest messages.
val spark = SparkSession
.builder
.appName("kafka-reading")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "latest")
.option("subscribe", topicName)
.load()

Related

Spark batch reading from Kafka & using Kafka to keep track of offsets

I understand that using Kafka's own offset tracking instead of other methods (like checkpointing) is problematic for streaming jobs.
However I just want to run a Spark batch job every day, reading all messages from the last offset to the most recent and do some ETL with it.
In theory I want to read this data like so:
val dataframe = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:6001")
.option("subscribe", "topic-in")
.option("includeHeaders", "true")
.option("kafka.group.id", s"consumer-group-for-this-job")
.load()
And have Spark commit the offsets back to Kafka based on the group.id
Unfortunately Spark never commits these back, so I went creative and added in the end of my etl job, this code to manually update the offsets for the consumer in Kafka:
val offsets: Map[TopicPartition, OffsetAndMetadata] = dataFrame
.select('topic, 'partition, 'offset)
.groupBy("topic", "partition")
.agg(max('offset))
.as[(String, Int, Long)]
.collect()
.map {
case (topic, partition, maxOffset) => new TopicPartition(topic, partition) -> new OffsetAndMetadata(maxOffset)
}
.toMap
val props = new Properties()
props.put("group.id", "consumer-group-for-this-job")
props.put("bootstrap.servers", "localhost:6001")
props.put("key.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put("enable.auto.commit", "false")
val kafkaConsumer = new KafkaConsumer[Array[Byte], Array[Byte]](props)
kafkaConsumer.commitSync(offsets.asJava)
Which technically works, but still next time reading based on this group.id Spark will still start from the beginning.
Do I have to bite the bullet and keep track of the offsets somewhere, or is there something I'm overlooking?
BTW I'm testing this with EmbeddedKafka
"However I just want to run a Spark batch job every day, reading all messages from the last offset to the most recent and do some ETL with it."
The Trigger.Once is exactly made for this kind of requirement.
There is a nice blog from Databricks that explains why "Streaming and RunOnce is Better than Batch".
Most importantly:
"When you’re running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you."
Although your approach is working technically, I would really recommend to have Spark take care of the offset management.
It probably does not work with EmbeddedKafka as this is running only in memory and not remembering that you have committed some offsets between runs of your test code. Therefore, it starts reading again and again from earliest offset.
I managed to resolve it by leaving the spark.read as is, ignoring the group.id etc. But surrounding it with my own KafkaConsumer logic.
protected val kafkaConsumer: String => KafkaConsumer[Array[Byte], Array[Byte]] =
groupId => {
val props = new Properties()
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, config.bootstrapServers)
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false")
new KafkaConsumer[Array[Byte], Array[Byte]](props)
}
protected def getPartitions(kafkaConsumer: KafkaConsumer[_, _], topic: String): List[TopicPartition] = {
import scala.collection.JavaConverters._
kafkaConsumer
.partitionsFor(topic)
.asScala
.map(p => new TopicPartition(topic, p.partition()))
.toList
}
protected def getPartitionOffsets(kafkaConsumer: KafkaConsumer[_, _], topic: String, partitions: List[TopicPartition]): Map[String, Map[String, Long]] = {
Map(
topic -> partitions
.map(p => p.partition().toString -> kafkaConsumer.position(p))
.map {
case (partition, offset) if offset == 0L => partition -> -2L
case mapping => mapping
}
.toMap
)
}
def getStartingOffsetsString(kafkaConsumer: KafkaConsumer[_, _], topic: String)(implicit logger: Logger): String = {
Try {
import scala.collection.JavaConverters._
val partitions: List[TopicPartition] = getPartitions(kafkaConsumer, topic)
kafkaConsumer.assign(partitions.asJava)
val startOffsets: Map[String, Map[String, Long]] = getPartitionOffsets(kafkaConsumer, topic, partitions)
logger.debug(s"Starting offsets for $topic: ${startOffsets(topic).filterNot(_._2 == -2L)}")
implicit val formats = org.json4s.DefaultFormats
Serialization.write(startOffsets)
} match {
case Success(jsonOffsets) => jsonOffsets
case Failure(e) =>
logger.error(s"Failed to retrieve starting offsets for $topic: ${e.getMessage}")
"earliest"
}
}
// MAIN CODE
val groupId = consumerGroupId(name)
val currentKafkaConsumer = kafkaConsumer(groupId)
val topic = config.topic.getOrElse(name)
val startingOffsets = getStartingOffsetsString(currentKafkaConsumer, topic)
val dataFrame = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", config.bootstrapServers)
.option("subscribe", topic)
.option("includeHeaders", "true")
.option("startingOffsets", startingOffsets)
.option("enable.auto.commit", "false")
.load()
Try {
import scala.collection.JavaConverters._
val partitions: List[TopicPartition] = getPartitions(kafkaConsumer, topic)
val numRecords = dataFrame.cache().count() // actually read data from kafka
kafkaConsumer.seekToEnd(partitions.asJava) // assume the read has head everything
val endOffsets: Map[String, Map[String, Long]] = getPartitionOffsets(kafkaConsumer, topic, partitions)
logger.debug(s"Loaded $numRecords records")
logger.debug(s"Ending offsets for $topic: ${endOffsets(topic).filterNot(_._2 == -2L)}")
kafkaConsumer.commitSync()
kafkaConsumer.close()
} match {
case Success(_) => ()
case Failure(e) =>
logger.error(s"Failed to set offsets for $topic: ${e.getMessage}")
}

Spark Streaming - Join on multiple kafka stream operation is slow

I have 3 kafka streams having 600k+ records each, spark streaming takes more than 10 mins to process simple joins between streams.
Spark Cluster config:
This is how i'm reading kafka streams to tempviews in spark(scala)
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "KAFKASERVER")
.option("subscribe", TOPIC1)
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest").load()
.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=SCHEMA1).as("data"))
.select($"COL1", $"COL2")
.createOrReplaceTempView("TABLE1")
I join 3 TABLES using spark spark sql
select COL1, COL2 from TABLE1
JOIN TABLE2 ON TABLE1.PK = TABLE2.PK
JOIN TABLE3 ON TABLE2.PK = TABLE3.PK
Execution of Job:
Am i missing out some configuration on spark that i've to look into?
I find the same problem. And I found join between stream and stream needs more memory as I image. And the problem disappear when I increase the cores per executor.
unfortunately there wasn't any test data nor the result data that you expected to be so I could play with, so I cannot give the exact proper answer.
#Asteroid comment is valid, as we see the number of task for each stage is 1. Normally Kafka stream use receiver to consume the topic; and each receiver only create one tasks. One approach is to use multiple receivers / split partition / Increase your resources (# of core) to increase parallelism.
If this still not working, another way is to use Kafka API to createDirectStream. According to the documentation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html, this one creates an input stream that directly pulls messages from Kafka Brokers without using any receiver.
I premilinary crafted a sample code for creating direct stream below. You might want to learn about this to customize to you own preference.
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "KAFKASERVER",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"startingOffsets" -> "earliest",
"endingOffsets" -> "latest"
)
val topics = Array(TOPIC1)
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val schema = StructType(StructField('data', StringType, True))
val df = spark.createDataFrame([], schema)
val dstream = stream.map(_.value())
dstream.forEachRDD(){rdd:RDD[String], time:Time} => {
val tdf = spark.read.schema(schema).json(rdd)
df = df.union(tdf)
df.createOrReplaceTempView("TABLE1")
}
Some related materials:
https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2/ (Scroll down to Kafka Consumer Code portion. The other section is irrelevant)
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html (Spark Doc for create direct stream)
Good luck!

kafka and Spark: Get first offset of a topic via API

I am playing with Spark Streaming and Kafka (with the Scala API), and would like to read message from a set of Kafka topics with Spark Streaming.
The following method:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "smallest")
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
reads from Kafka to the latest available offset, but doesn't give me the metadata that I need (since I am reading from a set of topics, I need for every message I read that topic) but this other method KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple2[String, String]](ssc, kafkaParams, currentOffsets, messageHandler) wants explicitly an offset that I don't have.
I know that there is this shell command that gives you the last offset.
kafka-run-class.sh kafka.tools.GetOffsetShell
--broker-list <broker>: <port>
--topic <topic-name> --time -1 --offsets 1
and KafkaCluster.scala is an API that is for developers that used to be public and gives you exactly what I would like.
Hint?
You can use the code from GetOffsetShell.scala kafka API documentation
val consumer = new SimpleConsumer(leader.host, leader.port, 10000, 100000, clientId)
val topicAndPartition = TopicAndPartition(topic, partitionId)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(time, nOffsets)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
Or you can create new consumer with unique groupId and use it for getting first offset
val consumer=new KafkaConsumer[String, String](createConsumerConfig(config.brokerList))
consumer.partitionsFor(config.topic).foreach(pi => {
val topicPartition = new TopicPartition(pi.topic(), pi.partition())
consumer.assign(List(topicPartition))
consumer.seekToBeginning()
val firstOffset = consumer.position(topicPartition)
...

Spark-Stream Differentiating Kafka Topics

If I feed my Spark-Streaming Application more than one topic like so:
val ssc = new StreamingContext(sc, Seconds(2))
val topics = Set("raw_1", "raw_2)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
When I run my application how can I figure out from my stream what the difference between which topic it is pulling from? Is there a way to do? If I do something like
val lines = stream.print()
I am getting nothing of differentiation. Is the only way to do it to make the Kafka Message Key a denoting factor?
Yes, you can use MessageAndMetadata version of createDirectStream which allows you to access message metadata.
You can find example implementation here .

How to write spark streaming DF to Kafka topic

I am using Spark Streaming to process data between two Kafka queues but I can not seem to find a good way to write on Kafka from Spark. I have tried this:
input.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach {
case x: String => {
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
println(x)
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String]("output", null, x)
producer.send(message)
}
}
)
)
and it works as intended but instancing a new KafkaProducer for every message is clearly unfeasible in a real context and I'm trying to work around it.
I would like to keep a reference to a single instance for every process and access it when I need to send a message. How can I write to Kafka from Spark Streaming?
Yes, unfortunately Spark (1.x, 2.x) doesn't make it straight-forward how to write to Kafka in an efficient manner.
I'd suggest the following approach:
Use (and re-use) one KafkaProducer instance per executor process/JVM.
Here's the high-level setup for this approach:
First, you must "wrap" Kafka's KafkaProducer because, as you mentioned, it is not serializable. Wrapping it allows you to "ship" it to the executors. The key idea here is to use a lazy val so that you delay instantiating the producer until its first use, which is effectively a workaround so that you don't need to worry about KafkaProducer not being serializable.
You "ship" the wrapped producer to each executor by using a broadcast variable.
Within your actual processing logic, you access the wrapped producer through the broadcast variable, and use it to write processing results back to Kafka.
The code snippets below work with Spark Streaming as of Spark 2.0.
Step 1: Wrapping KafkaProducer
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}
class MySparkKafkaProducer[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
/* This is the key idea that allows us to work around running into
NotSerializableExceptions. */
lazy val producer = createProducer()
def send(topic: String, key: K, value: V): Future[RecordMetadata] =
producer.send(new ProducerRecord[K, V](topic, key, value))
def send(topic: String, value: V): Future[RecordMetadata] =
producer.send(new ProducerRecord[K, V](topic, value))
}
object MySparkKafkaProducer {
import scala.collection.JavaConversions._
def apply[K, V](config: Map[String, Object]): MySparkKafkaProducer[K, V] = {
val createProducerFunc = () => {
val producer = new KafkaProducer[K, V](config)
sys.addShutdownHook {
// Ensure that, on executor JVM shutdown, the Kafka producer sends
// any buffered messages to Kafka before shutting down.
producer.close()
}
producer
}
new MySparkKafkaProducer(createProducerFunc)
}
def apply[K, V](config: java.util.Properties): MySparkKafkaProducer[K, V] = apply(config.toMap)
}
Step 2: Use a broadcast variable to give each executor its own wrapped KafkaProducer instance
import org.apache.kafka.clients.producer.ProducerConfig
val ssc: StreamingContext = {
val sparkConf = new SparkConf().setAppName("spark-streaming-kafka-example").setMaster("local[2]")
new StreamingContext(sparkConf, Seconds(1))
}
ssc.checkpoint("checkpoint-directory")
val kafkaProducer: Broadcast[MySparkKafkaProducer[Array[Byte], String]] = {
val kafkaProducerConfig = {
val p = new Properties()
p.setProperty("bootstrap.servers", "broker1:9092")
p.setProperty("key.serializer", classOf[ByteArraySerializer].getName)
p.setProperty("value.serializer", classOf[StringSerializer].getName)
p
}
ssc.sparkContext.broadcast(MySparkKafkaProducer[Array[Byte], String](kafkaProducerConfig))
}
Step 3: Write from Spark Streaming to Kafka, re-using the same wrapped KafkaProducer instance (for each executor)
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.RecordMetadata
val stream: DStream[String] = ???
stream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val metadata: Stream[Future[RecordMetadata]] = partitionOfRecords.map { record =>
kafkaProducer.value.send("my-output-topic", record)
}.toStream
metadata.foreach { metadata => metadata.get() }
}
}
Hope this helps.
My first advice would be to try to create a new instance in foreachPartition and measure if that is fast enough for your needs (instantiating heavy objects in foreachPartition is what the official documentation suggests).
Another option is to use an object pool as illustrated in this example:
https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/kafka/PooledKafkaProducerAppFactory.scala
I however found it hard to implement when using checkpointing.
Another version that is working well for me is a factory as described in the following blog post, you just have to check if it provides enough parallelism for your needs (check the comments section):
http://allegro.tech/2015/08/spark-kafka-integration.html
With Spark >= 2.2
Both read and write operations are possible on Kafka using Structured Streaming API
Build stream from Kafka topic
// Subscribe to a topic and read messages from the earliest to latest offsets
val ds= spark
.readStream // use `read` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("subscribe", "source-topic1")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
Read the key and value and apply the schema for both, for simplicity we are making converting both of them to String type.
val dsStruc = ds.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Since dsStruc have the schema, it accepts all SQL kind operations like filter, agg, select ..etc on it.
Write stream to Kafka topic
dsStruc
.writeStream // use `write` for batch, like DataFrame
.format("kafka")
.option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
.option("topic", "target-topic1")
.start()
More configuration for Kafka integration to read or write
Key artifacts to add in the application
"org.apache.spark" % "spark-core_2.11" % 2.2.0,
"org.apache.spark" % "spark-streaming_2.11" % 2.2.0,
"org.apache.spark" % "spark-sql-kafka-0-10_2.11" % 2.2.0,
There is a Streaming Kafka Writer maintained by Cloudera (actually spun off from a Spark JIRA [1]). It basically creates a producer per partition, which amortizes the time spent to create 'heavy' objects over a (hopefully large) collection of elements.
The Writer can be found here: https://github.com/cloudera/spark-kafka-writer
I was having the same issue and found this post.
The author solves the problem by creating 1 producer per executor. Instead of sending the producer itself, he sends only a “recipe” how to create a producer in an executor by broadcasting it.
val kafkaSink = sparkContext.broadcast(KafkaSink(conf))
He uses a wrapper that lazily creates the producer:
class KafkaSink(createProducer: () => KafkaProducer[String, String]) extends Serializable {
lazy val producer = createProducer()
def send(topic: String, value: String): Unit = producer.send(new ProducerRecord(topic, value))
}
object KafkaSink {
def apply(config: Map[String, Object]): KafkaSink = {
val f = () => {
val producer = new KafkaProducer[String, String](config)
sys.addShutdownHook {
producer.close()
}
producer
}
new KafkaSink(f)
}
}
The wrapper is serializable because the Kafka producer is initialized just before first use on an executor. The driver keeps the reference to the wrapper and the wrapper sends the messages using each executor's producer:
dstream.foreachRDD { rdd =>
rdd.foreach { message =>
kafkaSink.value.send("topicName", message)
}
}
Why is it infeasible? Fundamentally each partition of each RDD is going to run independently (and may well run on a different cluster node), so you have to redo the connection (and any synchronization) at the start of each partition's task. If the overhead of that is too high then you should increase the batch size in your StreamingContext until it becomes acceptable (obv. there's a latency cost to doing this).
(If you're not handling thousands of messages in each partition, are you sure you need spark-streaming at all? Would you do better with a standalone application?)
This might be what you want to do. You basically create one producer for each partition of records.
input.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
case x:String=>{
println(x)
val message=new ProducerRecord[String, String]("output",null,x)
producer.send(message)
}
}
})
)
Hope that helps
With Spark < 2.2
Since there is no direct way of writing the messages to Kafka from Spark Streaming
Create a KafkaSinkWritter
import java.util.Properties
import org.apache.kafka.clients.producer._
import org.apache.spark.sql.ForeachWriter
class KafkaSink(topic:String, servers:String) extends ForeachWriter[(String, String)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (String, String)): Unit = {
producer.send(new ProducerRecord(topic, value._1 + ":" + value._2))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
Write messages using SinkWriter
val topic = "<topic2>"
val brokers = "<server:ip>"
val writer = new KafkaSink(topic, brokers)
val query =
streamingSelectDF
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(ProcessingTime("25 seconds"))
.start()
Reference link