param message.send.max.retries does not work in kafka producer - scala

I have project on scala and sbt
I have producer
I tried to retry messages if kafka unreachable
package com.example
import akka.actor.ActorSystem
import akka.kafka.ProducerSettings
import akka.kafka.scaladsl.Producer
import akka.stream.scaladsl.Source
import akka.stream.{ActorMaterializer, ActorMaterializerSettings, Supervision}
import org.apache.kafka.clients.producer.{ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer
import scala.concurrent.Await
import scala.concurrent.duration._
import scala.util.control.NonFatal
object producer extends App {
private val decider: Supervision.Decider = {
case NonFatal(ex) =>
println("Non fatal exception in flow. Skip message and resuming flow.",
ex)
Supervision.Restart
case ex: Throwable =>
println("Other exception in flow. Stopping flow.", ex)
Supervision.Stop
}
implicit val system = ActorSystem("QuickStart")
private val strategy =
ActorMaterializerSettings(system).withSupervisionStrategy(decider)
implicit val materializer = ActorMaterializer(strategy)
val config = system.settings.config.getConfig("akka.kafka.producer")
val producerSettings =
ProducerSettings(system, new StringSerializer, new StringSerializer)
.withBootstrapServers("10.20.10.193:9092")
.withProperty("message.send.max.retries", "3")
.withProperty(ProducerConfig.MAX_BLOCK_MS_CONFIG, "5000")
//.withProperty(ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG, "5000")
val done =
Source
.single("11")
.map(value => new ProducerRecord[String, String]("example", value))
.runWith(Producer.plainSink(producerSettings))
Await.result(done, 1000 seconds)
}
I defined a property:
.withProperty("message.send.max.retries", "3")
but it does not work
when I run produser with a bad kafka host - an output is
[INFO ] - 2018-07-30 23:00:36,951 - suppression - akka.event.slf4j.Slf4jLogger - Slf4jLogger started
(Non fatal exception in flow. Skip message and resuming flow.,org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 5000 ms.)
(Non fatal exception in flow. Skip message and resuming flow.,org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 5000 ms.)
Only two retries in log instead of three

Related

Scala Package Issue With ZKStringSerializer

I am trying to use the class ZKStringSerializer, which I get with
import kafka.utils.ZKStringSerializer
According to the entirety of the internet, and even my own code before I restarted by computer, this should allow my code to work. However, I now get an incredibly confusing compile error,
object ZKStringSerializer in package utils cannot be accessed in package kafka.utils
This is confusing because this file is not supposed to be in any package, and I don't specify a package anywhere. This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
import org.I0Itec.zkclient.ZkClient
import org.I0Itec.zkclient.ZkConnection
import java.util.Properties
import org.apache.kafka.clients.admin
import kafka.admin.{AdminUtils, RackAwareMode}
import kafka.utils.ZKStringSerializer
import kafka.utils.ZkUtils
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val zookeeperConnect = "localhost:2181"
val sessionTimeoutMs = 10000
val connectionTimeoutMs = 10000
val zkClient = new ZkClient(zookeeperConnect, sessionTimeoutMs, connectionTimeoutMs, ZKStringSerializer)
val topicName = "testTopic"
val numPartitions = 8
val replicationFactor = 1
val topicConfig = new Properties
val isSecureKafkaCluster = false
val zkUtils = new ZkUtils(zkClient, new ZkConnection(zookeeperConnect), isSecureKafkaCluster)
AdminUtils.createTopic(zkUtils, topicName, numPartitions, replicationFactor, topicConfig)
// Create producer for topic testTopic and actually push values to the topic
val props = new Properties()
props.put("bootstrap.servers", "localhost:9592")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val TOPIC = "testTopic"
for (i <- 1 to 50) {
val record = new ProducerRecord(TOPIC, "key", s"hello $i")
producer.send(record)
}
val record = new ProducerRecord(TOPIC, "key", "the end" + new java.util.Date)
producer.send(record)
producer.flush()
producer.close()
}
}
I know this is too late, but for others who will be looking for the same issue-
In the latest version of kafka, kafka.utils got deprecated. So please use kafka admin client apis

Integration test Flink and Kafka with scalatest-embedded-kafka

I would like to run integration test with Flink and Kafka. The process is to read from Kafka, some manipulation with Flink and put the datastream in kafka.
I would like to test the process from the begining to the end. For now I use scalatest-embedded-kafka.
I put an example here I tried to be as simple as possible :
import java.util.Properties
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}
import org.scalatest.{Matchers, WordSpec}
import scala.collection.mutable.ListBuffer
object SimpleFlinkKafkaTest {
class CollectSink extends SinkFunction[String] {
override def invoke(string: String): Unit = {
synchronized {
CollectSink.values += string
}
}
}
object CollectSink {
val values: ListBuffer[String] = ListBuffer.empty[String]
}
val kafkaPort = 9092
val zooKeeperPort = 2181
val props = new Properties()
props.put("bootstrap.servers", "localhost:" + kafkaPort.toString)
props.put("schema.registry.url", "localhost:" + zooKeeperPort.toString)
val inputString = "mystring"
val expectedString = "MYSTRING"
}
class SimpleFlinkKafkaTest extends WordSpec with Matchers with EmbeddedKafka {
"runs with embedded kafka" should {
"work" in {
implicit val config = EmbeddedKafkaConfig(
kafkaPort = SimpleFlinkKafkaTest.kafkaPort,
zooKeeperPort = SimpleFlinkKafkaTest.zooKeeperPort
)
withRunningKafka {
publishStringMessageToKafka("input-topic", SimpleFlinkKafkaTest.inputString)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val kafkaConsumer = new FlinkKafkaConsumer011(
"input-topic",
new SimpleStringSchema,
SimpleFlinkKafkaTest.props
)
implicit val typeInfo = TypeInformation.of(classOf[String])
val inputStream = env.addSource(kafkaConsumer)
val outputStream = inputStream.map(_.toUpperCase)
val kafkaProducer = new FlinkKafkaProducer011(
"output-topic",
new SimpleStringSchema(),
SimpleFlinkKafkaTest.props
)
outputStream.addSink(kafkaProducer)
env.execute()
consumeFirstStringMessageFrom("output-topic") shouldEqual SimpleFlinkKafkaTest.expectedString
}
}
}
}
I had en error so I add the line implicit val typeInfo = TypeInformation.of(classOf[String]) but I don't really understand why I have to do that.
For now this code doesn't work, it runs without interuption but do not stop and do not give any result.
If someone hase any idea ? Even better idea to test this kind of pipeline.
Thanks !
EDIT : add env.execute() and change error.
Here's a simple solution I came up with.
The idea is to:
Start Kafka Embedded server
Create your test topics (here input and output)
Launch Flink job in a Future to avoid blocking the main thread
Publish a message to the input topic
Check the result on the output topic
And the working prototype:
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.core.fs.FileSystem.WriteMode
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}
import org.scalatest.{Matchers, WordSpec}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
class SimpleFlinkKafkaTest extends WordSpec with Matchers with EmbeddedKafka {
"runs with embedded kafka on arbitrary available ports" should {
val env = StreamExecutionEnvironment.getExecutionEnvironment
"work" in {
val userDefinedConfig = EmbeddedKafkaConfig(kafkaPort = 9092, zooKeeperPort = 2182)
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2182")
properties.setProperty("group.id", "test")
properties.setProperty("auto.offset.reset", "earliest")
val kafkaConsumer = new FlinkKafkaConsumer011[String]("input", new SimpleStringSchema(), properties)
val kafkaSink = new FlinkKafkaProducer011[String]("output", new SimpleStringSchema(), properties)
val stream = env
.addSource(kafkaConsumer)
.map(_.toUpperCase)
.addSink(kafkaSink)
withRunningKafkaOnFoundPort(userDefinedConfig) { implicit actualConfig =>
createCustomTopic("input")
createCustomTopic("output")
Future{env.execute()}
publishStringMessageToKafka("input", "Titi")
consumeFirstStringMessageFrom("output") shouldEqual "TITI"
}
}
}
}

How to it make pure?

I have following scala code:
import akka.Done
import akka.actor.ActorSystem
import akka.kafka.ConsumerMessage.CommittableOffsetBatch
import akka.kafka.scaladsl.Consumer
import akka.kafka.{ConsumerSettings, Subscriptions}
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import scala.concurrent.Future
object TestConsumer {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("KafkaConsumer")
implicit val materializer = ActorMaterializer()
val consumerSettings = ConsumerSettings(system, new StringDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val result = Consumer
.committableSource(consumerSettings, Subscriptions.topics("test"))
.mapAsync(2)(rec => Future.successful(rec.record.value()))
.runWith(Sink.foreach(ele => {
print(ele)
system.terminate()
}))
}
}
As you can recognize, the application consumes message from kafka printed out on the shell.
runWith is not pure, it generates some side effect, print out the received message and shutdown the actor.
The question is, how to make it pure with cats IO effects? It is possible?
You don't need cats IO to make it pure. Note that your sink is already pure, because it's just the value that describes what will happen when it's used (in this case using means "connecting to the Source and running the stream").
val sink: Sink[String, Future[Done]] = Sink.foreach(ele => {
print(ele)
// system.terminate() // PROBLEM: terminating the system before stream completes!
})
The problem you described has nothing to do with purity. The problem is that the sink above closes over the value of system, and then tries to terminate it when processing each element of the source.
Terminating the system means that you are destroying the whole runtime environment (used by ActorMaterializer) that is used to run the stream. This should only be done when your stream completes.
val result: Future[Done] = Consumer
.committableSource(consumerSettings, Subscriptions.topics("test"))
.mapAsync(2)(rec => Future.successful(rec.record.value()))
.runWith(sink)
result.onComplete(_ => system.terminate())

Reactive-Kafka Stream Consumer: Dead letters occured

I am trying to consume messages from Kafka using akka's reactive kafka library. I am getting one message printed and after that I got
[INFO] [01/24/2017 10:36:52.934] [CommittableSourceConsumerMain-akka.actor.default-dispatcher-5] [akka://CommittableSourceConsumerMain/system/kafka-consumer-1] Message [akka.kafka.KafkaConsumerActor$Internal$Stop$] from Actor[akka://CommittableSourceConsumerMain/deadLetters] to Actor[akka://CommittableSourceConsumerMain/system/kafka-consumer-1#-1726905274] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
This is the code I am executing
import akka.actor.ActorSystem
import akka.kafka.scaladsl.Consumer
import akka.kafka.{ConsumerSettings, Subscriptions}
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import org.apache.kafka.clients.consumer.ConsumerConfig
import play.api.libs.json._
import org.apache.kafka.common.serialization.{ByteArrayDeserializer, StringDeserializer}
object CommittableSourceConsumerMain extends App {
implicit val system = ActorSystem("CommittableSourceConsumerMain")
implicit val materializer = ActorMaterializer()
val consumerSettings =ConsumerSettings(system, new ByteArrayDeserializer, new StringDeserializer).withBootstrapServers("localhost:9092").withGroupId("CommittableSourceConsumer").withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val done =
Consumer.committableSource(consumerSettings, Subscriptions.topics("topic1"))
.mapAsync(1) { msg =>
val record=(msg.record.value())
val data=Json.parse(record)
val recordType=data \ "data" \"event" \"type"
val actualData=data \ "data" \ "row"
if(recordType.as[String]=="created"){
"Some saving logic"
}
else{
"Some logic"
}
msg.committableOffset.commitScaladsl()
}
.runWith(Sink.ignore)
}
I finally figured out the solution. Due to a runtime exception in the stream a Future of failure is returned which terminates the stream immediately.
Akka-stream does not provide or display the runtime exception. So as to know the exception
done.onFailure{
case NonFatal(e)=>println(e)
}
The exception was in the if-else block.
Also one can use Actor Strategy to resume stream if exception occurs.

Unable to serialize SparkContext in foreachRDD

I am trying to save the streaming data to cassandra from Kafka. I am able to read and parse the data but when I call below lines to save the data i am getting a Task not Serializable Exception. My class is extending serializable but not sure why i am seeing this error, didn't get much help ever after googling for 3 hours, can some body give any pointers ?
val collection = sc.parallelize(Seq((obj.id, obj.data)))
collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))`
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import com.datastax.spark.connector._
import kafka.serializer.StringDecoder
import org.apache.spark.rdd.RDD
import com.datastax.spark.connector.SomeColumns
import java.util.Formatter.DateTime
object StreamProcessor extends Serializable {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamProcessor")
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = args.toSet
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
stream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
try {
rdd.foreachPartition { iter =>
iter.foreach {
case (key, msg) =>
val obj = msgParseMaster(msg)
val collection = sc.parallelize(Seq((obj.id, obj.data)))
collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))
}
}
}
}
}
ssc.start()
ssc.awaitTermination()
}
import org.json4s._
import org.json4s.native.JsonMethods._
case class wordCount(id: Long, data: String) extends serializable
implicit val formats = DefaultFormats
def msgParseMaster(msg: String): wordCount = {
val m = parse(msg).extract[wordCount]
return m
}
}
I am getting
org.apache.spark.SparkException: Task not serializable
below is the full log
16/08/06 10:24:52 ERROR JobScheduler: Error running job streaming job 1470504292000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at
SparkContext isn't serializable, you can't use it inside foreachRDD, and from the use of your graph you don't need it. Instead, you can simply map over each RDD, parse out the relevant data and save that new RDD to cassandra:
stream
.map {
case (_, msg) =>
val result = msgParseMaster(msg)
(result.id, result.data)
}
.foreachRDD(rdd => if (!rdd.isEmpty)
rdd.saveToCassandra("testKS",
"testTable",
SomeColumns("id", "data")))
You can not call sc.parallelize within a function passed to foreachPartition - that function would have to be serialized and sent to each executor, and SparkContext is (intentionally) not serializable (it should only reside within the Driver application, not the executor).