Spark kafka producer Introducing duplicate records during kafka ingestion - scala

I have written a spark kafka producer, which pulls messages from hive and pushes into kafka, most of the records(messages) are getting duplicated when we ingest into kafka, though i do not have any duplicates before pushing into kafka. I have added configurations related to exactly-once semantics making kafka producer idempotent
below is the code-snippet i am using for kafka producer
import java.util.{Properties, UUID}
import org.apache.kafka.clients.CommonClientConfigs
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.KafkaException
import org.apache.kafka.common.errors.{AuthorizationException, OutOfOrderSequenceException, ProducerFencedException}
import org.apache.log4j.LogManager
import org.apache.spark.sql.DataFrame
object KafkaWriter {
lazy val log = LogManager.getLogger("KafkaWriter")
private def writeKafkaWithoutRepartition(df: DataFrame, topic: String, noOfPartitions: Int,
kafkaBootstrapServer: String): Unit = {
log.info("Inside Method KafkaWriter::writeKafkaWithoutRepartition no of partitions =" + noOfPartitions)
df.foreachPartition(
iter => {
val properties = getKafkaProducerPropertiesForBulkLoad(kafkaBootstrapServer)
properties.setProperty(ProducerConfig.TRANSACTIONAL_ID_CONFIG, UUID.randomUUID().toString + "-" + System.currentTimeMillis().toString)
log.info("Inside Method writeKafkaWithoutRepartition:: inside writekafka :: kafka properties ::" + properties)
val kafkaProducer = new KafkaProducer[String, String](properties)
try {
log.info("kafka producer property enable.idempotence ::" + properties.getProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG))
kafkaProducer.initTransactions
kafkaProducer.beginTransaction
log.info("Inside Method writeKafkaWithoutRepartition:: inside each partition kafka transactions started")
iter.foreach(row => {
log.info("Inside Method writeKafkaWithoutRepartition:: inside each iterator record kafka transactions started")
kafkaProducer.send(new ProducerRecord(topic, row.getAs[String]("value")))
})
kafkaProducer.commitTransaction
log.info("Inside Method writeKafkaWithoutRepartition:: kafka transactions completed")
} catch {
case e#(_: ProducerFencedException ) =>
// We can't recover from these exceptions, so our only option is to close the producer and exit.
log.error("Exception occured while sending records to kafka ::" + e.getMessage)
kafkaProducer.close
case e: KafkaException =>
// For all other exceptions, just abort the transaction and try again.
log.error("Exception occured while sending records to kafka ::" + e.getMessage)
kafkaProducer.abortTransaction
case ex: Exception =>
// For all other exceptions, just abort the transaction and try again.
log.error("Exception occured while sending records to kafka ::" + ex.getMessage)
kafkaProducer.abortTransaction
} finally {
kafkaProducer.close
}
})
}
def writeWithoutRepartition(df: DataFrame, topic: String, noOfPartitions: Int, kafkaBootstrapServer: String): Unit = {
var repartitionedDF = df.selectExpr("to_json(struct(*)) AS value")
log.info("Inside KafkaWriter::writeWithoutRepartition ")
writeKafkaWithoutRepartition(repartitionedDF, topic, noOfPartitions, kafkaBootstrapServer)
}
def getKafkaProducerPropertiesForBulkLoad(kafkaBootstrapServer: String): Properties = {
val properties = new Properties
properties.setProperty(CommonClientConfigs.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServer)
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true")
// properties.setProperty(CommonClientConfigs.REQUEST_TIMEOUT_MS_CONFIG, "400000")
// properties.setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, "300000")
properties.put(ProducerConfig.RETRIES_CONFIG, "1000")
properties.put(ProducerConfig.ACKS_CONFIG, "all")
properties.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1")
properties
}
}
set the isolation.level-->Committed on the kafka consumer end.
tried setting min.insync.replicas-->2(In my opinion this property might not play an important role , still tried)
Spark version : 2.3.1
kafka client version : 2.2.1
And I am also using transactions while producing messages into kafka, init begin and commit transactions for each message.I am ingesting around 100 Million records one time, I did split the data into smaller chunks, say 100 Million divided into 1 Million at once before ingesting into kafka
Tried using structured streaming, still no luck
df.selectExpr(s""" '${key}' as key """, "to_json(struct(*)) AS value").write.format("kafka").options(getKafkaProducerProperties(topic)).save
I am not sure if i am missing any configurations on the kafka producer , broker or at the consumer end.
Not sure if i can include any
Thanks in Advance

Related

One to One instant messaging using Kafka

I'm using Scala and Kafka to create topic based pub-sub architecture.
My question is how can I handle One-to-One Messaging part of my application using Kafka topics?
This is my producer class:
class Producer(topic: String, key: String, brokers: String, message: String) {
val producer = new KafkaProducer[String, String](configuration)
private def configuration: Properties = {
val props = new Properties()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.ACKS_CONFIG, "all")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getCanonicalName)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getCanonicalName)
props
}
def sendMessages(): Unit = {
val record = new ProducerRecord[String, String](topic, key, message)
producer.send(record)
producer.close()
}
}
And this is my consumer class:
class Consumer(brokers: String, topic: String, groupId: String) {
val consumer = new KafkaConsumer[String, String](configuration)
consumer.subscribe(util.Arrays.asList(topic))
private def configuration: Properties = {
val props = new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getCanonicalName)
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, classOf[StringDeserializer].getCanonicalName)
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
//props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true)
props
}
def receiveMessages(): Unit = {
while (true) {
consumer.poll(Duration.ofSeconds(0)).forEach(record => println(s"Received message: $record"))
}
}
}
I also have an auth service that takes cares care of everything related to authenticating via JWT tokens.
I am confused on how to create messages to specific users, I thought about creating a "Messages" class but I got lost when it comes to how to send these "specific" users messages and how to partition these messages on kafka for later usage:
class Message {
def sendMessage(sender_id: String, receiver_id: String, content: String): Unit ={
val newMessage = new Producer(brokers = KAFKA_BROKER,key =sender_id + " to " + receiver_id, topic = "topic_1", message = content)
newMessage.sendMessages()
}
def loadMessage(): Unit ={
//
}
}
My thought was to specify a custom key for all messages belonging to the same conversation but I couldn't find the right way to retrieve these messages later on as my consumer returns everything contained in that topic no matter what the key is. Meaning, all the users will eventually get all the messages. I know my modeling seems messy but I couldn't find the right way to do it, I'm also kinda confused when it comes to the usage of the group_id in the consumer.
Could someone make me what's the right way to achieve what I'm trying to do here please ?
couldn't find the right way to retrieve these messages later on ... consumer returns everything contained in that topic no matter what the key is
You would need to .assign the Consumer instance to a specific partition, not use .subscribe, which reads all partitions. Or you'd use specific topics for each conversation.
But then you need unique partitions/topics for every conversation that will exist. In a regular chat application where users create/remove rooms randomly, that will not scale for Kafka.
Ultimately, I'd suggest writing your data to somewhere else than Kafka that you can actually query and index on a "convertsationId" and/or user ids rather than try to forward those events directly from Kafka into your "chat" application.

Returning values from kafka consumer

I am working on a scala application. I am using kafka in it. I am to consume message from kafka topic. Since I am writing a test case, I need to get record do some assertion to get pass my testcase. I am using following code to consume kafka message:
import java.util.{Collections, Properties}
import java.util.regex.Pattern
import org.apache.kafka.clients.consumer.KafkaConsumer
import scala.collection.JavaConverters._
object KafkaConsumerSubscribeApp extends App {
val props:Properties = new Properties()
props.put("group.id", "test")
props.put("bootstrap.servers","localhost:9092")
props.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
Props.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
props.put("enable.auto.commit", "true")
props.put("auto.commit.interval.ms", "1000")
val consumer = new KafkaConsumer(props)
val topics = List("topic_text")
try {
consumer.subscribe(topics.asJava)
while (true) {
val records = consumer.poll(10)
for (record <- records.asScala) {
println("Topic: " + record.topic() +
",Key: " + record.key() +
",Value: " + record.value() +
", Offset: " + record.offset() +
", Partition: " + record.partition())
}
}
}catch{
case e:Exception => e.printStackTrace()
}finally {
consumer.close()
}
}
There are two problems I am facing with this code. In intellij it is giving warning that poll method is deprecated. How can I modify this code for deprecation? Second problem is I want to this method to return the message which it gets from kafka topic. That message is in record.value(). How can I return it? With this code since while(true) is used, so it will be a endless loop and it will keep listening to messages from topic. How can I return record.value() from this method so that I can use data which I got from topic in other methods.
If you want to return you have to modify the codebase as follows. To remove the deprecation of the poll, you need to just pass the durations. For more info refer
def readFromKafka(//arguments) = {
// other code
consumer.subscribe(util.Collections.singletonList(topic))
consumer.poll(Duration.ofMillis(5000)).asScala.toList.map(_.value())
}
To keep on reading from Kafka, you have to keep on calling the function. For that, you may use the scheduler.
I hope it will help.

Azure Databricks Concurrent Job -Avoid Consuming the same eventhub messages in all Jobs

Please help us to implement the partition/grouping when receiving eventhub messages in a Azure Databricks concurrent Job and the right approach to consume eventhub messages in a concurrent job.
Created 3 concurrent jobs in Azure Databricks uploading consumer code written in scala as a jar files. In this case receiving the same messages in all 3 concurrent jobs. To overcome from this issue tried to consume the events by partitioning but receiving the same messages in all 3 partitions.
And also tried by sending messages based on partition key and also tried creating a consumer groups in eventhubs even though receiving same messages in all the groups. We are not sure to handle the eventhub messages in the concurrent job
EventHub Configuration:
No.of partitions is 3 and Message Retention is 1
EventHub Producer: Sending messages to Eventhub using .NET (C#) is working fine.
EventHub Consumer: Able to receive messages through Scala Program without any issues.
Problem : Created 3 concurrent jobs in Azure Databricks uploading consumer code written in Scala as a jar files.In this case receiving the same messages in all 3 concurrent jobs. To overcome from this issue tried to consume the events by partitioning but receiving the same messages in all 3 partitions.And
also tried by sending messages based on partition key and also tried creating a consumer groups in eventhubs even though receiving same messages in all the groups. We are not sure to handle the eventhub messages in the concurrent job.
Producer C# Code:
string eventHubName = ConfigurationManager.AppSettings["eventHubname"];
string connectionString = ConfigurationManager.AppSettings["eventHubconnectionstring"];
eventHubClient = EventHubClient.CreateFromConnectionString(connectionString, eventHubName);
for (var i = 0; i < 100; i++)
{
var sender = "event hub message 1" + i;
var data = new EventData(Encoding.UTF8.GetBytes(sender));
Console.WriteLine($"Sending message: {sender}");
eventHubClient.SendAsync(data);
}
eventHubClient.CloseAsync();
Console.WriteLine("Press ENTER to exit.");
Console.ReadLine();
Consumer Scala Code:
object ReadEvents {
val spark = SparkSession.builder()
.appName("eventhub")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
def main(args : Array[String]) : Unit = {
val connectionString = ConnectionStringBuilder("ConnectionString").setEventHubName("eventhub1").build
val positions = Map(new NameAndPartition("eventhub1", 0) -> EventPosition.fromStartOfStream)
val position2 = Map(new NameAndPartition("eventhub1", 1) -> EventPosition.fromEnqueuedTime(Instant.now()))
val position3 = Map(new NameAndPartition("eventhub1", 2) -> EventPosition.fromEnqueuedTime(Instant.now()))
val ehConf = EventHubsConf(connectionString).setStartingPositions(positions)
val ehConf2 = EventHubsConf(connectionString).setStartingPositions(position2)
val ehConf3 = EventHubsConf(connectionString).setStartingPositions(position3)
val stream = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf)
println("Before the loop")
stream.foreachRDD(rdd => {
rdd.collect().foreach(rec => {
println(String.format("Message is first stream ===>: %s", new String(rec.getBytes(), Charset.defaultCharset())))
})
})
val stream2 = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf2)
stream2.foreachRDD(rdd2 => {
rdd2.collect().foreach(rec2 => {
println(String.format("Message second stream is ===>: %s", new String(rec2.getBytes(), Charset.defaultCharset())))
})
})
val stream3 = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf)
stream3.foreachRDD(rdd3 => {
println("Inside 3rd stream foreach loop")
rdd3.collect().foreach(rec3 => {
println(String.format("Message is third stream ===>: %s", new String(rec3.getBytes(), Charset.defaultCharset())))
})
})
ssc.start()
ssc.awaitTermination()
}
}
Expecting to partition the eventhub messages properly when receiving it on concurrent jobs running using scala program.
Below code help to iterate through all partions
eventHubsStream.foreachRDD { rdd =>
rdd.foreach { message =>
if (message != null) {
callYouMethod(new String(message.getBytes()))
}
}
}

Kafka producer hangs on send

The logic is that a streaming job, getting data from a custom source has to write both to Kafka as well as HDFS.
I wrote a (very) basic Kafka producer to do this, however the whole streaming job hangs on the send method.
class KafkaProducer(val kafkaBootstrapServers: String, val kafkaTopic: String, val sslCertificatePath: String, val sslCertificatePassword: String) {
val kafkaProps: Properties = new Properties()
kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServers)
kafkaProps.put("acks", "1")
kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put("ssl.truststore.location", sslCertificatePath)
kafkaProps.put("ssl.truststore.password", sslCertificatePassword)
val kafkaProducer: KafkaProducer[Long, Array[String]] = new KafkaProducer(kafkaProps)
def sendKafkaMessage(message: Message): Unit = {
message.data.foreach(list => {
val producerRecord: ProducerRecord[Long, Array[String]] = new ProducerRecord[Long, Array[String]](kafkaTopic, message.timeStamp.getTime, list.toArray)
kafkaProducer.send(producerRecord)
})
}
}
And the code calling the producer:
receiverStream.foreachRDD(rdd => {
val messageRowRDD: RDD[Row] = rdd.mapPartitions(partition => {
val parser: Parser = new Parser
val kafkaProducer: KafkaProducer = new KafkaProducer(kafkaBootstrapServers, kafkaTopic, kafkaSslCertificatePath, kafkaSslCertificatePass)
val newPartition = partition.map(message => {
Logger.getLogger("importer").error("Writing Message to Kafka...")
kafkaProducer.sendKafkaMessage(message)
Logger.getLogger("importer").error("Finished writing Message to Kafka")
Message.data.map(singleMessage => parser.parseMessage(Message.timeStamp.getTime, singleMessage))
})
newPartition.flatten
})
val df = sqlContext.createDataFrame(messageRowRDD, Schema.messageSchema)
Logger.getLogger("importer").info("Entries-count: " + df.count())
val row = Try(df.first)
row match {
case Success(s) => Persister.writeDataframeToDisk(df, outputFolder)
case Failure(e) => Logger.getLogger("importer").warn("Resulting DataFrame is empty. Nothing can be written")
}
})
From the logs I can tell that each executor is reaching the "sending to kafka" point, however not any further. All executors hang on that and no exception is thrown.
The Message class is a very simple case class with 2 fields, a timestamp and an array of strings.
This was due to the acks field in Kafka.
Acks was set to 1 and sends went ahead a lot faster.

running akka stream in parallel

I have a stream that
listens for HTTP post receiving a list of events
mapconcat the list of events in stream elements
convert events in kafka record
produce the record with reactive kafka (akka stream kafka producer sink)
Here is the simplified code
// flow to split group of lines into lines
val splitLines = Flow[List[Evt]].mapConcat(list=>list)
// sink to produce kafka records in kafka
val kafkaSink: Sink[Evt, Future[Done]] = Flow[Evt]
.map(evt=> new ProducerRecord[Array[Byte], String](evt.eventType, evt.value))
.toMat(Producer.plainSink(kafka))(Keep.right)
val routes = {
path("ingest") {
post {
(entity(as[List[ReactiveEvent]]) & extractMaterializer) { (eventIngestList,mat) =>
val ingest= Source.single(eventIngestList).via(splitLines).runWith(kafkaSink)(mat)
val result = onComplete(ingest){
case Success(value) => complete(s"OK")
case Failure(ex) => complete((StatusCodes.InternalServerError, s"An error occurred: ${ex.getMessage}"))
}
complete("eventList ingested: " + result)
}
}
}
}
Could you highlight me what is run in parallel and what is sequential ?
I think the mapConcat sequentialize the events in the stream so how could I parallelize the stream so after the mapConcat each step would be processed in parallel ?
Would a simple mapAsyncUnordered be sufficient ? Or should I use the GraphDSL with a Balance and Merge ?
In your case it will be sequential I think. Also you're getting whole request before you start pushing data to Kafka. I'd use extractDataBytes directive that gives you src: Source[ByteString, Any]. Then I'd process it like
src
.via(Framing.delimiter(ByteString("\n"), 1024 /* Max size of line */ , allowTruncation = true).map(_.utf8String))
.mapConcat { line =>
line.split(",")
}.async
.runWith(kafkaSink)(mat)