Troubleshooting high time taken to publish event to Kafka - apache-kafka

Publishing to kafka takes 3 seconds in one of the environment, however we see that, in other environments it takes only 20 milli seconds .
We have done following to troubleshoot this issue:
We have run route trace and we dont see an network latency.
Analyzed the logs and found following :
2023-01-05T07:26:24.627Z DEBUG 40 --- [ad | producer-5] o.a.k.c.NetworkClient : [Producer clientId=producer-5] Using older server API v7 to send PRODUCE {acks=-1,timeout=30000,partitionSizes=[esd.mob.vocc.datacollector.sampio.plab01.raw-9=464]} with correlation id 1156 to node 46
2023-01-05T07:26:24.627Z TRACE 40 --- [ad | producer-5] o.a.k.c.p.i.Sender : [Producer clientId=producer-5] Sent produce request to 46: (type=ProduceRequest, acks=-1, timeout=30000, partitionRecords=({esd.mob.vocc.datacollector.sampio.plab01.raw-9=MemoryRecords(size=464, buffer=java.nio.HeapByteBuffer[pos=0 lim=464 cap=464])}), transactionalId=''
2023-01-05T07:26:27.700Z TRACE 40 --- [ad | producer-5] o.a.k.c.NetworkClient : [Producer clientId=producer-5] Completed receive from node 46 for PRODUCE with correlation id 1156, received {responses=[{topic=esd.mob.vocc.datacollector.sampio.plab01.raw,partition_responses=[{partition=9,error_code=0,base_offset=31376,log_append_time=-1,log_start_offset=31001}]}],throttle_time_ms=0}
2023-01-05T07:26:27.700Z TRACE 40 --- [ad | producer-5] o.a.k.c.p.i.Sender : [Producer clientId=producer-5] Received produce response from node 46 with correlation id 1156
the log clearly states that it takes 3 seconds to publish the message .
We are using below mentioned code .
CompletableFuture<SendResult<String, String>> sendResultCompletableFuture = kafkaTemplate.send(payloadMsg);
sendResultCompletableFuture.whenComplete((data, ex) -> {
if (Objects.nonNull(ex)) {
LOGGER.error("Exception while publishing message to kafka, {}", ExceptionUtils.getStackTrace(ex));
throw new KafkaConnectorException(ExceptionUtils.getStackTrace(ex), topic, message);
} else {
LOGGER.info("Message handle published to kafka successfully for key ::");
}
});
Please help in troubleshooting this issue further and share thoughts.

Related

Apache flume with kafka source, kafka sink and memory channel - throwing UNKNOWN_TOPIC_OR_PARTITION

I am new to Apache flume https://flume.apache.org/. For one of the use-case, I need to move data from the Kafka topic on one cluster (bootstrap: bootstrap1, topic: topic1) to topic with different name in a different cluster (bootstrap: bootstrap2, topic: topic2). There are another use-cases in same project which fits best for flume and I need to use same flume pipeline for this use-case though there are other options to copy from Kafka to Kafka.
I tried below configs and the results are as mentioned in each options.
#1: telnet to kafka sink (bootstrap2, topic2) --> works perfect.
configs:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = topic2
a1.sinks.k1.kafka.bootstrap.servers = bootstrap2
a1.sinks.k1.kafka.flumeBatchSize = 100
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#2: kafka as source(bootstrap1, topic1) and logger as sink --> works perfect.
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 10
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = bootstrap1
a1.sources.r1.kafka.topics = topic1
a1.sources.r1.kafka.consumer.group.id = flume-gis-consumer
a1.sources.r1.backoffSleepIncrement = 1000
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#3: kafka as source (bootstrap1, topic1) and kafka as sink(bootstrap2, topic2) --> gives error as mentioned below the config.
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 10
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = bootstrap1
a1.sources.r1.kafka.topics = topic1
a1.sources.r1.kafka.consumer.group.id = flume-gis-consumer1
a1.sources.r1.backoffSleepIncrement = 1000
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = topic2
a1.sinks.k1.kafka.bootstrap.servers = bootstrap2
a1.sinks.k1.kafka.flumeBatchSize = 100
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Error:
(kafka-producer-network-thread | producer-1) [WARN - org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:968)] [Producer clientId=producer-1] Error while fetching metadata with correlation id 85 : {topic1=UNKNOWN_TOPIC_OR_PARTITION}
Continuously shows above error.
ERROR upon terminating flume-ng command
(SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:158)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to publish events
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:268)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.EventDeliveryException: Could not send event
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:234)
... 3 more
Seeking help from the stackoverflow community on:
What config is going wrong here. Kafka topics exists in respective clusters. (Option 1 and Option 2 works fine and I can see messages flowing from source to sink)
Why producer thread is trying to produce in source kafka topic?
I encountered the same issue today. My case is even worse because I host two topics on a single Kafka cluster.
It is really misleading that the producer thread in Kafka sink is producing back to the Kafka source topic.
I fixed the issue by setting allowTopicOverride to false for Kafka sink.
Quote from Kafka sink part in Flume document:
allowTopicOverride: Default is true. When set, the sink will allow a message to be produced into a topic specified by the topicHeader property (if provided).
topicHeader: When set in conjunction with allowTopicOverride will produce a message into the value of the header named using the value of this property. Care should be taken when using in conjunction with the Kafka Source topicHeader property to avoid creating a loopback.
And in Kafka source part:
setTopicHeader: Default is true. When set to true, stores the topic of the retrieved message into a header, defined by the topicHeader property.
So by default, Apache Flume store the Kafka source topic in topicHeader for each event. Then, Kafka sink by default write to the topic specify in topicHeader.

Kafka consumer test and reported metrics

I want to understand how kafka consumer test works and how to interpret some of numbers reported,
below is the test i ran and the output i got. My questions are
values reported for rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec are 1593109326098, -1593108732333, -0.0003, -0.2800; can you explain how it can report such a high and negative numbers ? they dont make sense to me.
Everything reported from Metric Name Value line is reported due to --print-metrics flag. What is difference between metrics reported by default and with this flag? how they are calculated and where can i read about what do they mean?
No matter i scale total consumer running in parallel or scale network and io threads at broker, consumer-fetch-manager-metrics:fetch-latency-avg metrics remains almost same. can you explain this? with more consumers pulling data fetch latency should go higher; similarly for given consuming rate if i reduce io and network thread parameters at broker shouldnt latency scale higher?
here is the command i ran
[root#oak-clx17 kafka_2.12-2.5.0]# bin/kafka-consumer-perf-test.sh --topic topic_test8_cons_test1 --threads 1 --broker-list clx20:9092 --messages 500000000 --consumer.config config/consumer.properties --print-metrics
and results
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms,fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
WARNING: Exiting before consuming the expected number of messages: timeout (10000 ms) exceeded. You can use the --timeout option to increase the timeout.
2020-06-25 11:22:05:814, 2020-06-25 11:31:59:579, 435640.7686, 733.6922, 446096147, 751300.8463, 1593109326098, -1593108732333, -0.0003, -0.2800
 
Metric Name Value
consumer-coordinator-metrics:assigned-partitions:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:commit-latency-avg:{client-id=consumer-perf-consumer-25533-1} : 2.700
consumer-coordinator-metrics:commit-latency-max:{client-id=consumer-perf-consumer-25533-1} : 4.000
consumer-coordinator-metrics:commit-rate:{client-id=consumer-perf-consumer-25533-1} : 0.230
consumer-coordinator-metrics:commit-total:{client-id=consumer-perf-consumer-25533-1} : 119.000
consumer-coordinator-metrics:failed-rebalance-rate-per-hour:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:failed-rebalance-total:{client-id=consumer-perf-consumer-25533-1} : 1.000
consumer-coordinator-metrics:heartbeat-rate:{client-id=consumer-perf-consumer-25533-1} : 0.337
consumer-coordinator-metrics:heartbeat-response-time-max:{client-id=consumer-perf-consumer-25533-1} : 6.000
consumer-coordinator-metrics:heartbeat-total:{client-id=consumer-perf-consumer-25533-1} : 197.000
consumer-coordinator-metrics:join-rate:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:join-time-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:join-time-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:join-total:{client-id=consumer-perf-consumer-25533-1} : 1.000
consumer-coordinator-metrics:last-heartbeat-seconds-ago:{client-id=consumer-perf-consumer-25533-1} : 2.000
consumer-coordinator-metrics:last-rebalance-seconds-ago:{client-id=consumer-perf-consumer-25533-1} : 593.000
consumer-coordinator-metrics:partition-assigned-latency-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:partition-assigned-latency-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:partition-lost-latency-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:partition-lost-latency-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:partition-revoked-latency-avg:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:partition-revoked-latency-max:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:rebalance-latency-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:rebalance-latency-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:rebalance-latency-total:{client-id=consumer-perf-consumer-25533-1} : 83.000
consumer-coordinator-metrics:rebalance-rate-per-hour:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:rebalance-total:{client-id=consumer-perf-consumer-25533-1} : 1.000
consumer-coordinator-metrics:sync-rate:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-coordinator-metrics:sync-time-avg:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:sync-time-max:{client-id=consumer-perf-consumer-25533-1} : NaN
consumer-coordinator-metrics:sync-total:{client-id=consumer-perf-consumer-25533-1} : 1.000
consumer-fetch-manager-metrics:bytes-consumed-rate:{client-id=consumer-perf-consumer-25533-1, topic=topic_test8_cons_test1} : 434828205.989
consumer-fetch-manager-metrics:bytes-consumed-rate:{client-id=consumer-perf-consumer-25533-1} : 434828205.989
consumer-fetch-manager-metrics:bytes-consumed-total:{client-id=consumer-perf-consumer-25533-1, topic=topic_test8_cons_test1} : 460817319851.000
consumer-fetch-manager-metrics:bytes-consumed-total:{client-id=consumer-perf-consumer-25533-1} : 460817319851.000
consumer-fetch-manager-metrics:fetch-latency-avg:{client-id=consumer-perf-consumer-25533-1} : 58.870
consumer-fetch-manager-metrics:fetch-latency-max:{client-id=consumer-perf-consumer-25533-1} : 503.000
consumer-fetch-manager-metrics:fetch-rate:{client-id=consumer-perf-consumer-25533-1} : 48.670
consumer-fetch-manager-metrics:fetch-size-avg:{client-id=consumer-perf-consumer-25533-1, topic=topic_test8_cons_test1} : 9543108.526
consumer-fetch-manager-metrics:fetch-size-avg:{client-id=consumer-perf-consumer-25533-1} : 9543108.526
consumer-fetch-manager-metrics:fetch-size-max:{client-id=consumer-perf-consumer-25533-1, topic=topic_test8_cons_test1} : 11412584.000
consumer-fetch-manager-metrics:fetch-size-max:{client-id=consumer-perf-consumer-25533-1} : 11412584.000
consumer-fetch-manager-metrics:fetch-throttle-time-avg:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-fetch-manager-metrics:fetch-throttle-time-max:{client-id=consumer-perf-consumer-25533-1} : 0.000
consumer-fetch-manager-metrics:fetch-total:{client-id=consumer-perf-consumer-25533-1} : 44889.000
Exception in thread "main" java.util.IllegalFormatConversionException: f != java.lang.Integer
at java.base/java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:4426)
at java.base/java.util.Formatter$FormatSpecifier.printFloat(Formatter.java:2951)
at java.base/java.util.Formatter$FormatSpecifier.print(Formatter.java:2898)
at java.base/java.util.Formatter.format(Formatter.java:2673)
at java.base/java.util.Formatter.format(Formatter.java:2609)
at java.base/java.lang.String.format(String.java:2897)
at scala.collection.immutable.StringLike.format(StringLike.scala:354)
at scala.collection.immutable.StringLike.format$(StringLike.scala:353)
at scala.collection.immutable.StringOps.format(StringOps.scala:33)
at kafka.utils.ToolsUtils$.$anonfun$printMetrics$3(ToolsUtils.scala:60)
at kafka.utils.ToolsUtils$.$anonfun$printMetrics$3$adapted(ToolsUtils.scala:58)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at kafka.utils.ToolsUtils$.printMetrics(ToolsUtils.scala:58)
at kafka.tools.ConsumerPerformance$.main(ConsumerPerformance.scala:82)
at kafka.tools.ConsumerPerformance.main(ConsumerPerformance.scala)
https://medium.com/metrosystemsro/apache-kafka-how-to-test-performance-for-clients-configured-with-ssl-encryption-3356d3a0d52b it's a post I've written on this.
Expectation according to this KIP is for rebalance.time.ms and fetch.time.ms to show total rebalance time for the consumer group and the total fetching time excluding the rebalance time.
As far I can tell, as of Apache Kafka version 2.6.0 this is still a work in progress and currently the output are timestamps in Unix epoch time.
fetch.MB.sec and fetch.nMsg.sec are intended to show the average quantity of messages consumed per second (in MB and as a count)
See https://kafka.apache.org/documentation/#consumer_group_monitoring for the consumer group metrics listed with --print-metrics flag
fetch-latency-avg (the average time taken for a fetch request) will vary but this depends a lot on the test setup.

org.apache.kafka.common.errors.RecordTooLargeException in Flume Kafka Sink

I am trying to read data from JMS source and pushing them into KAFKA topic, while doing that after few hours i observed that pushing frequency to the KAFKA topic became almost zero and after some initial analysis i found following exception in FLUME logs .
28 Feb 2017 16:35:44,758 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.SinkRunner$PollingRunner.run:158) - Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to publish events
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:252)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1399305 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
at org.apache.kafka.clients.producer.KafkaProducer$FutureFailure.<init>(KafkaProducer.java:686)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:449)
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:212)
... 3 more
Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1399305 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
my flume shows the current set value (in logs ) for max.request.size as 1048576 , which is clearly very less than 1399305 , increasing this max.request.size may eliminate these exception but am unable to find correct place for updating that value .
My flume.config ,
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.channels.c1.type = file
a1.channels.c1.transactionCapacity = 1000
a1.channels.c1.capacity = 100000000
a1.channels.c1.checkpointDir = /data/flume/apache-flume-1.7.0-bin/checkpoint
a1.channels.c1.dataDirs = /data/flume/apache-flume-1.7.0-bin/data
a1.sources.r1.type = jms
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.preserveExisting = true
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = some context urls
a1.sources.r1.connectionFactory = some_queue
a1.sources.r1.providerURL = some_url
#a1.sources.r1.providerURL = some_url
a1.sources.r1.destinationType = QUEUE
a1.sources.r1.destinationName = some_queue_name
a1.sources.r1.userName = some_user
a1.sources.r1.passwordFile= passwd
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = some_kafka_topic
a1.sinks.k1.kafka.bootstrap.servers = some_URL
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.flumeBatchSize = 1
a1.sinks.k1.channel = c1
Any help will be really appreciated !!
This change has to be done at Kafka.
Update the Kafka producer configuration file producer.properties with a larger value like
max.request.size=10000000
It seems like i have resolved my issue ; As suspected increasing the max.request.size eliminated the exception , for updating such kafka sink(producer) properties FLUME provides the constant prefix as kafka.producer. and we can append this constant prefix with any kafka properties ;
so mine goes as, a1.sinks.k1.kafka.producer.max.request.size = 5271988 .

Spark Job keeps failing by the running the same map 3 times

My job has a step where I am converting the data frame as RDD[(key, value)] but the step runs three 3 times and getting stuck in the third time and fails
Spark UI shows :
Active Jobs(1)
Job Id (Job Group) Description Submitted Duration Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
3 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 05:50:02 1.6 min 0/1 210/45623
Completed Jobs (2)
2 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 05:16:28 23 min 1/1 46742/46075 (21 failed)
1 (zeppelin-20161017-005442_839671900) Zeppelin map at <console>:69 2016/10/25 04:47:58 17 min 1/1 47369/46795 (20 failed)
This is the code :
val eventsRDD = eventsDF.map {
r =>
val customerId = r.getAs[String]("customerId")
val itemId = r.getAs[String]("itemId")
val countryId = r.getAs[Long]("countryId").toInt
val timeStamp = r.getAs[String]("eventTimestamp")
val totalRent = r.getAs[Int]("totalRent")
val totalPurchase = r.getAs[Int]("totalPurchase")
val totalProfit = r.getAs[Int]("totalProfit")
val store = r.getAs[String]("store")
val itemName = r.getAs[String]("itemName")
val itemName = if (itemName.size > 0 && itemName.nonEmpty && itemName != null ) itemName else "NA"
(itemId, (customerId, countryId, timeStamp, totalRent, totalProfit, totalPurchase, store,itemName ))
}
Can someone tell what is wrong here ? If I want persist/cache which one I should do ?
Error :
16/10/25 23:28:55 INFO YarnClientSchedulerBackend: Asked to remove non-existent executor 181
16/10/25 23:28:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1477415847345_0005_02_031011 on host: ip-172-31-14-104.ec2.internal. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_1477415847345_0005_02_031011
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
You map operation is resulting in some error and that propogates to the driver which results in task failure.
By default spark.task.maxFailures has value as 4 which is for :
Number of failures of any particular task before giving up on the job.
The total number of failures spread across different tasks will not
cause the job to fail; a particular task has to fail this number of
attempts. Should be greater than or equal to 1. Number of allowed
retries = this value - 1.
So what happens when your task fails spark tries to recompute the map operation untill it has failed 4 times in all.
If I want persist/cache which one I should do ?
cache is just specific persist operation where RDD is persisted with default storage level (MEMORY_ONLY).

Kafka Producer 0.9 performance issue with small messages

We are observing very poor performance with a Java Kafka Producer 0.9 client when sending small messages. The messages are not being accumulated into a larger request batch and thus each small record is being sent separately.
What is wrong with our client configuration? Or is this some other issue?
Using Kafka Client 0.9.0.0. We did not see any related postings in the Kafka unreleased 9.0.1 or 9.1 fixed or unresolved lists, so we are focused on our client configuration and server instance.
We understand the linger.ms should cause the client to accumulate records into a batch.
We set linger.ms to 10 (and also tried 100 and 1000) but these did not result in the batch accumulating records. With a record size of about 100 bytes and a request buffer size of 16K, We would have expected about 160 messages to be sent in a single request.
The trace at the client seems to indicate that the partition may be full, despite having allocated a fresh Bluemix Messaging Hub (Kafka Server 0.9) service instance. The test client is sending multiple messages in a loop with no other I/O.
The log shows a repeating sequence with a suspect line: "Waking up the sender since topic mytopic partition 0 is either full or getting a new batch".
So the newly allocated partition should be essentially empty in our test case, thus why would the producer client be getting a new batch?
2015-12-10 15:14:41,335 3677 [main] TRACE com.isllc.client.producer.ExploreProducer - Sending record: Topic='mytopic', Key='records', Value='Kafka 0.9 Java Client Record Test Message 00011 2015-12-10T15:14:41.335-05:00'
2015-12-10 15:14:41,336 3678 [main] TRACE org.apache.kafka.clients.producer.KafkaProducer - Sending record ProducerRecord(topic=mytopic, partition=null, key=[B#670b40af, value=[B#4923ab24 with callback null to topic mytopic partition 0
2015-12-10 15:14:41,336 3678 [main] TRACE org.apache.kafka.clients.producer.internals.RecordAccumulator - Allocating a new 16384 byte message buffer for topic mytopic partition 0
2015-12-10 15:14:41,336 3678 [main] TRACE org.apache.kafka.clients.producer.KafkaProducer - Waking up the sender since topic mytopic partition 0 is either full or getting a new batch
2015-12-10 15:14:41,348 3690 [kafka-producer-network-thread | ExploreProducer] TRACE org.apache.kafka.clients.producer.internals.Sender - Nodes with data ready to send: [Node(0, kafka01-prod01.messagehub.services.us-south.bluemix.net, 9094)]
2015-12-10 15:14:41,348 3690 [kafka-producer-network-thread | ExploreProducer] TRACE org.apache.kafka.clients.producer.internals.Sender - Created 1 produce requests: [ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.producer.internals.Sender$1#6d62e963, request=RequestSend(header={api_key=0,api_version=1,correlation_id=11,client_id=ExploreProducer}, body={acks=-1,timeout=30000,topic_data=[{topic=mytopic,data=[{partition=0,record_set=java.nio.HeapByteBuffer[pos=0 lim=110 cap=16384]}]}]}), createdTimeMs=1449778481348, sendTimeMs=0)]
2015-12-10 15:14:41,412 3754 [kafka-producer-network-thread | ExploreProducer] TRACE org.apache.kafka.clients.producer.internals.Sender - Received produce response from node 0 with correlation id 11
2015-12-10 15:14:41,412 3754 [kafka-producer-network-thread | ExploreProducer] TRACE org.apache.kafka.clients.producer.internals.RecordBatch - Produced messages to topic-partition mytopic-0 with base offset offset 130 and error: null.
2015-12-10 15:14:41,412 3754 [main] TRACE com.isllc.client.producer.ExploreProducer - Send returned metadata: Topic='mytopic', Partition=0, Offset=130
2015-12-10 15:14:41,412 3754 [main] TRACE com.isllc.client.producer.ExploreProducer - Sending record: Topic='mytopic', Key='records', Value='Kafka 0.9 Java Client Record Test Message 00012 2015-12-10T15:14:41.412-05:00'
Log entries repeat like the above for each record sent
We provided the following properties file:
2015-12-10 15:14:37,843 185 [main] INFO com.isllc.client.AbstractClient - Properties retrieved from file for Kafka client: kafka-producer.properties
2015-12-10 15:14:37,909 251 [main] INFO com.isllc.client.AbstractClient - acks=-1
2015-12-10 15:14:37,909 251 [main] INFO com.isllc.client.AbstractClient - ssl.protocol=TLSv1.2
2015-12-10 15:14:37,909 251 [main] INFO com.isllc.client.AbstractClient - key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - client.id=ExploreProducer
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - ssl.truststore.identification.algorithm=HTTPS
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - ssl.truststore.password=changeit
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - ssl.truststore.type=JKS
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - ssl.enabled.protocols=TLSv1.2
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - ssl.truststore.location=/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/lib/security/cacerts
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - bootstrap.servers=kafka01-prod01.messagehub.services.us-south.bluemix.net:9094,kafka02-prod01.messagehub.services.us-south.bluemix.net:9094,kafka03-prod01.messagehub.services.us-south.bluemix.net:9094,kafka04-prod01.messagehub.services.us-south.bluemix.net:9094,kafka05-prod01.messagehub.services.us-south.bluemix.net:9094
2015-12-10 15:14:37,910 252 [main] INFO com.isllc.client.AbstractClient - security.protocol=SASL_SSL
Plus we added linger.ms=10 in code.
The Kafka Client shows the expanded/merged configuration list (and displaying the linger.ms setting):
2015-12-10 15:14:37,970 312 [main] INFO org.apache.kafka.clients.producer.ProducerConfig - ProducerConfig values:
compression.type = none
metric.reporters = []
metadata.max.age.ms = 300000
metadata.fetch.timeout.ms = 60000
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
bootstrap.servers = [kafka01-prod01.messagehub.services.us-south.bluemix.net:9094, kafka02-prod01.messagehub.services.us-south.bluemix.net:9094, kafka03-prod01.messagehub.services.us-south.bluemix.net:9094, kafka04-prod01.messagehub.services.us-south.bluemix.net:9094, kafka05-prod01.messagehub.services.us-south.bluemix.net:9094]
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
buffer.memory = 33554432
timeout.ms = 30000
key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.keystore.type = JKS
ssl.trustmanager.algorithm = PKIX
block.on.buffer.full = false
ssl.key.password = null
max.block.ms = 60000
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
ssl.truststore.password = [hidden]
max.in.flight.requests.per.connection = 5
metrics.num.samples = 2
client.id = ExploreProducer
ssl.endpoint.identification.algorithm = null
ssl.protocol = TLSv1.2
request.timeout.ms = 30000
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2]
acks = -1
batch.size = 16384
ssl.keystore.location = null
receive.buffer.bytes = 32768
ssl.cipher.suites = null
ssl.truststore.type = JKS
security.protocol = SASL_SSL
retries = 0
max.request.size = 1048576
value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
ssl.truststore.location = /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre/lib/security/cacerts
ssl.keystore.password = null
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
send.buffer.bytes = 131072
linger.ms = 10
The Kafka metrics after sending 100 records:
Duration for 100 sends 8787 ms. Sent 7687 bytes.
batch-size-avg = 109.87 [The average number of bytes sent per partition per-request.]
batch-size-max = 110.0 [The max number of bytes sent per partition per-request.]
buffer-available-bytes = 3.3554432E7 [The total amount of buffer memory that is not being used (either unallocated or in the free list).]
buffer-exhausted-rate = 0.0 [The average per-second number of record sends that are dropped due to buffer exhaustion]
buffer-total-bytes = 3.3554432E7 [The maximum amount of buffer memory the client can use (whether or not it is currently used).]
bufferpool-wait-ratio = 0.0 [The fraction of time an appender waits for space allocation.]
byte-rate = 291.8348916277093 []
compression-rate = 0.0 []
compression-rate-avg = 0.0 [The average compression rate of record batches.]
connection-close-rate = 0.0 [Connections closed per second in the window.]
connection-count = 2.0 [The current number of active connections.]
connection-creation-rate = 0.05180541884681138 [New connections established per second in the window.]
incoming-byte-rate = 10.342564641029007 []
io-ratio = 0.0038877559207471236 [The fraction of time the I/O thread spent doing I/O]
io-time-ns-avg = 353749.2840375587 [The average length of time for I/O per select call in nanoseconds.]
io-wait-ratio = 0.21531227995769162 [The fraction of time the I/O thread spent waiting.]
io-wait-time-ns-avg = 1.9591901192488264E7 [The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.]
metadata-age = 8.096 [The age in seconds of the current producer metadata being used.]
network-io-rate = 5.2937784999213795 [The average number of network operations (reads or writes) on all connections per second.]
outgoing-byte-rate = 451.2298783403283 []
produce-throttle-time-avg = 0.0 [The average throttle time in ms]
produce-throttle-time-max = 0.0 [The maximum throttle time in ms]
record-error-rate = 0.0 [The average per-second number of record sends that resulted in errors]
record-queue-time-avg = 15.5 [The average time in ms record batches spent in the record accumulator.]
record-queue-time-max = 434.0 [The maximum time in ms record batches spent in the record accumulator.]
record-retry-rate = 0.0 []
record-send-rate = 2.65611304417116 [The average number of records sent per second.]
record-size-avg = 97.87 [The average record size]
record-size-max = 98.0 [The maximum record size]
records-per-request-avg = 1.0 [The average number of records per request.]
request-latency-avg = 0.0 [The average request latency in ms]
request-latency-max = 74.0 []
request-rate = 2.6468892499606897 [The average number of requests sent per second.]
request-size-avg = 42.0 [The average size of all requests in the window..]
request-size-max = 170.0 [The maximum size of any request sent in the window.]
requests-in-flight = 0.0 [The current number of in-flight requests awaiting a response.]
response-rate = 2.651196976060479 [The average number of responses received per second.]
select-rate = 10.989861465830819 [Number of times the I/O layer checked for new I/O to perform per second]
waiting-threads = 0.0 [The number of user threads blocked waiting for buffer memory to enqueue their records]
Thanks
Guozhang Wang on the Kafka Users mailing list was able to recognize the problem by reviewing our application code:
Guozhang,
Yes - you identified the problem!
We had inserted the .get() for debugging, but didn’t think of the
(huge!) side-effects.
Using the async callback works perfectly well.
We are now able to send 100,000 records in 14 sec from a laptop to the
Bluemix cloud - ~1000x faster,
Thank you very much!
Gary
On Dec 13, 2015, at 2:48 PM, Guozhang Wang wrote:
Gary,
You are calling "kafkaProducer.send(record).get();" for each message,
the get() call block until the Future is initialized, which
effectively synchronize all message sent by asking for the ACK for
each message before sending the next message, hence no batching.
You can try using "send(record, callback)" for async sending and let
the callback handle errors from the returned metadata.
Guozhang