Kafka transaction: Receiving CONCURRENT_TRANSACTIONS on AddPartitionsToTxnRequest - apache-kafka

I am trying to publish in a transaction a message on 16 Kafka partitions on 7 brokers.
The flow is like this:
open transaction
write a message to 16 partitions
commit transaction
sleep 25 ms
repeat
Sometimes the transaction takes over 1 second to complete, with an average of 50 ms.
After enabling trace logging on producer's side, I noticed the following error:
TRACE internals.TransactionManager [kafka-producer-network-thread | producer-1] - [Producer clientId=producer-1, transactionalId=cma-2]
Received transactional response AddPartitionsToTxnResponse(errors={modelapp-ecb-0=CONCURRENT_TRANSACTIONS, modelapp-ecb-9=CONCURRENT_TRANSACTIONS, modelapp-ecb-10=CONCURRENT_TRANSACTIONS, modelapp-ecb-11=CONCURRENT_TRANSACTIONS, modelapp-ecb-12=CONCURRENT_TRANSACTIONS, modelapp-ecb-13=CONCURRENT_TRANSACTIONS, modelapp-ecb-14=CONCURRENT_TRANSACTIONS, modelapp-ecb-15=CONCURRENT_TRANSACTIONS, modelapp-ecb-1=CONCURRENT_TRANSACTIONS, modelapp-ecb-2=CONCURRENT_TRANSACTIONS, modelapp-ecb-3=CONCURRENT_TRANSACTIONS, modelapp-ecb-4=CONCURRENT_TRANSACTIONS, modelapp-ecb-5=CONCURRENT_TRANSACTIONS, modelapp-ecb-6=CONCURRENT_TRANSACTIONS, modelapp-ecb-=CONCURRENT_TRANSACTIONS, modelapp-ecb-8=CONCURRENT_TRANSACTIONS}, throttleTimeMs=0)
for request (type=AddPartitionsToTxnRequest, transactionalId=cma-2, producerId=59003, producerEpoch=0, partitions=[modelapp-ecb-0, modelapp-ecb-9, modelapp-ecb-10, modelapp-ecb-11, modelapp-ecb-12, modelapp-ecb-13, modelapp-ecb-14, modelapp-ecb-15, modelapp-ecb-1, modelapp-ecb-2, modelapp-ecb-3, modelapp-ecb-4, modelapp-ecb-5, modelapp-ecb-6, modelapp-ecb-7, modelapp-ecb-8])
The Kafka producer retries sending AddPartitionsToTxnRequest(s) several times until it succeeds, but this leads to delays.
The code looks like this:
Properties producerProperties = PropertiesUtil.readPropertyFile(_producerPropertiesFile);
_producer = new KafkaProducer<>(producerProperties);
_producer.initTransactions();
_producerService = Executors.newSingleThreadExecutor(new NamedThreadFactory(getClass().getSimpleName()));
_producerService.submit(() -> {
while (!Thread.currentThread().isInterrupted()) {
try {
_producer.beginTransaction();
for (int partition = 0; partition < _numberOfPartitions; partition++)
_producer.send(new ProducerRecord<>(_producerTopic, partition, KafkaRecordKeyFormatter.formatControlMessageKey(_messageNumber, token), EMPTY_BYTE_ARRAY));
_producer.commitTransaction();
_messageNumber++;
Thread.sleep(_timeBetweenProducedMessagesInMillis);
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException | UnsupportedVersionException e) {
closeProducer();
break;
} catch (KafkaException e) {
_producer.abortTransaction();
} catch (InterruptedException e) {...}
}
});
Looking to broker's code, it seems there are 2 cases when this error is thrown, but I cannot tell why I get there
object TransactionCoordinator {
...
def handleAddPartitionsToTransaction(...): Unit = {
...
if (txnMetadata.pendingTransitionInProgress) {
// return a retriable exception to let the client backoff and retry
Left(Errors.CONCURRENT_TRANSACTIONS)
} else if (txnMetadata.state == PrepareCommit || txnMetadata.state == PrepareAbort) {
Left(Errors.CONCURRENT_TRANSACTIONS)
}
...
}
...
}
Thanks in advance for help!
Later edit:
Enabling trace logging on broker we were able to see that broker sends to the producer END_TXN response before transaction reaches state CompleteCommit. The producer is able to start a new transaction, which is rejected by the broker while it is still in the transition PrepareCommit -> CompleteCommit.

Related

How to catch/Capture "javax.net.ssl.SSLHandshakeException: Failed to create SSL connection" while sending message over java vert.x Eventbus

I am trying to use SSL over eventbus. To test the failure case I tried sending message to the eventbus from another verticle in same cluster by passing some different keystore.
I am getting below exception on console but it is not failing the replyHandler hence my code is not able to detect the SSL exception.
my code:
eb.request("ping-address", "ping!", new DeliveryOptions(), reply -> {
try {
if (reply.succeeded()) {
System.out.println("Received reply " + reply.result().body());
} else {
System.out.println("An exception " + reply.cause().getMessage());
}
} catch (Exception e) {
System.out.println("An error occured" + e.getCause());
}
});
Exception on console:
**javax.net.ssl.SSLHandshakeException: Failed to create SSL connection**
at io.vertx.core.net.impl.ChannelProvider$1.userEventTriggered(ChannelProvider.java:109)
at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:341)
at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:327)
at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:319)
at io.netty.handler.ssl.SslHandler.handleUnwrapThrowable(SslHandler.java:1249)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1230)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:813)
Caused by: javax.net.ssl.SSLException: Received fatal alert: bad_certificate
at sun.security.ssl.Alerts.getSSLException(Alerts.java:208)
at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1647)
at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1615)
at sun.security.ssl.SSLEngineImpl.recvAlert(SSLEngineImpl.java:1781)
at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:1070)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:896)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:766)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:282)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1329)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224)
... 20 more
But handler is failing for timeout after 30 sec.
Timed out after waiting 30000(ms) for a reply. address: __vertx.reply.8419a431-d633-4ba8-a12e-c41fd5a4f37a, repliedAddress: ping-address
I want to capture the SSL exception immediately and handle it. Please guide me how can I Capture/catch this exception.
I tried with below code. Below one is able to handle the exception and I am not getting reply result from called event-bus. Reply result is always null. (value is always null)
MessageProducer<Object> ms = eb.sender("ping-address");
ms.write("ping!", reply -> {
if (reply.succeeded()) {
reply.map(value -> {
System.out.println("Received reply " + value);
return reply;
});
} else {
System.out.println("No reply");
System.out.println("An exception : " + reply.cause().getMessage());
}
});
You can't catch this exception because the Vert.x clustered EventBus implementation buffers messages when the nodes are not connected together. The message could be sent later if the problem is only temporary.
If you want to be notified earlier, you could set a lower timeout in DeliveryOptions.

Producer lost some message on kafka restart

Kafka Client : 0.11.0.0-cp1
Kafka Broker :
On Kafka broker rolling restart, our application lost some messages while sending to broker. I believe with rolling restart there should not be any loss of message. These are the producer (Using Producer with asynchronous send() and not using callback/future etc) settings we are using :
val acksConfig: String = "all",
val retriesConfig: Int = Int.MAX_VALUE,
val retriesBackOffConfig: Int = 1000,
val batchSize: Int = 32768,
val lingerTime: Int = 1,
val maxBlockTime: Int = Int.MAX_VALUE,
val requestTimeOut: Int = 420000,
val bufferMemory: Int = 33_554_432,
val compressionType: String = "gzip",
val keySerializer: Class<StringSerializer> = StringSerializer::class.java,
val valueSerializer: Class<ByteArraySerializer> = ByteArraySerializer::class.java
I am seeing these exceptions in the logs
2019-03-19 17:30:59,224 [org.apache.kafka.clients.producer.internals.Sender] [kafka-producer-network-thread | producer-1] (Sender.java:511) WARN org.apache.kafka.clients.producer.internals.Sender - Got error produce response with correlation id 1105790 on topic-partition catapult_on_entitlement_updates_prod-67, retrying (2147483643 attempts left). Error: NOT_LEADER_FOR_PARTITION
But log says retry attempt left, i am curious why didnt it retry then? Let me know if anyone has any idea?
Two things to note:
What is the replication factor of the topic you are producing and what is the required number of min.insync.replicas?
What do you mean by "producer lost some messages". The producer if it cannot successfully produce to #min.insync.replicas brokers it will throw an exception and fail (for synchronous production). It is up to the producer/ client to retry in case of failure (synchronous or asynchronous production).

Gracefully restart a Reactive-Kafka Consumer Stream on failure

Problem
When I restart/complete/STOP stream the old Consumer does not Die/Shutdown:
[INFO ] a.a.RepointableActorRef -
Message [akka.kafka.KafkaConsumerActor$Internal$Stop$]
from Actor[akka://ufo-sightings/deadLetters]
to Actor[akka://ufo-sightings/system/kafka-consumer-1#1896610594]
was not delivered. [1] dead letters encountered.
Description
I'm building a service that receives a message from Kafka topic and sends the message to an external service via HTTP request.
A connection with the external service can be broken, and my service needs to retry the request.
Additionally, if there is an error in the Stream, entire stream needs to restart.
Finally, sometimes I don't need the stream and its corresponding Kafka-consumer and I would like to shut down the entire stream
So I have a Stream:
Consumer.committableSource(customizedSettings, subscriptions)
.flatMapConcat(sourceFunction)
.toMat(Sink.ignore)
.run
Http request is sent in sourceFunction
I followed new Kafka Consumer Restart instructions in the new documentation
RestartSource.withBackoff(
minBackoff = 20.seconds,
maxBackoff = 5.minutes,
randomFactor = 0.2 ) { () =>
Consumer.committableSource(customizedSettings, subscriptions)
.watchTermination() {
case (consumerControl, streamComplete) =>
logger.info(s" Started Watching Kafka consumer id = ${consumer.id} termination: is shutdown: ${consumerControl.isShutdown}, is f completed: ${streamComplete.isCompleted}")
consumerControl.isShutdown.map(_ => logger.info(s"Shutdown of consumer finally happened id = ${consumer.id} at ${DateTime.now}"))
streamComplete
.flatMap { _ =>
consumerControl.shutdown().map(_ -> logger.info(s"3.consumer id = ${consumer.id} SHUTDOWN at ${DateTime.now} GRACEFULLY:CLOSED FROM UPSTREAM"))
}
.recoverWith {
case _ =>
consumerControl.shutdown().map(_ -> logger.info(s"3.consumer id = ${consumer.id} SHUTDOWN at ${DateTime.now} ERROR:CLOSED FROM UPSTREAM"))
}
}
.flatMapConcat(sourceFunction)
}
.viaMat(KillSwitches.single)(Keep.right)
.toMat(Sink.ignore)(Keep.left)
.run
There is an issue opened that discusses this non-terminating Consumer in a complex Akka-stream, but there is no solution yet.
Is there a workaround that forces the Kafka Consumer termination
How about wrapping the consumer in an Actor and registering a KillSwitch, see: https://doc.akka.io/docs/akka/2.5/stream/stream-dynamic.html#dynamic-stream-handling
Then in the Actor postStop method you can terminate the stream.
By wrapping the Actor in a BackoffSupervisor, you get the exponential backoff.
Example actor: https://github.com/tradecloud/kafka-akka-extension/blob/master/src/main/scala/nl/tradecloud/kafka/KafkaSubscriberActor.scala#L27

log.debug("Coordinator discovery failed for group {}, refreshing metadata", groupId) with kafka 0.11.0.x

I'm using Kafka (version 0.11.0.2) server API to start a kafka broker in localhost.As it run without any problem.Producer also can send messages success.But consumer can't get messages and there is nothing error log in console.So I debugged the code and it looping for "refreshing metadata".
Here is the source code
while (coordinatorUnknown()) {
RequestFuture<Void> future = lookupCoordinator();
client.poll(future, remainingMs);
if (future.failed()) {
if (future.isRetriable()) {
remainingMs = timeoutMs - (time.milliseconds() - startTimeMs);
if (remainingMs <= 0)
break;
log.debug("Coordinator discovery failed for group {}, refreshing metadata", groupId);
client.awaitMetadataUpdate(remainingMs);
} else
throw future.exception();
} else if (coordinator != null && client.connectionFailed(coordinator)) {
// we found the coordinator, but the connection has failed, so mark
// it dead and backoff before retrying discovery
coordinatorDead();
time.sleep(retryBackoffMs);
}
remainingMs = timeoutMs - (time.milliseconds() - startTimeMs);
if (remainingMs <= 0)
break;
}
Adddtion: I change the Kafka version to 0.10.x,its run OK.
Here is my Kafka server code.
private static void startKafkaLocal() throws Exception {
final File kafkaTmpLogsDir = File.createTempFile("zk_kafka", "2");
if (kafkaTmpLogsDir.delete() && kafkaTmpLogsDir.mkdir()) {
Properties props = new Properties();
props.setProperty("host.name", KafkaProperties.HOSTNAME);
props.setProperty("port", String.valueOf(KafkaProperties.KAFKA_SERVER_PORT));
props.setProperty("broker.id", String.valueOf(KafkaProperties.BROKER_ID));
props.setProperty("zookeeper.connect", KafkaProperties.ZOOKEEPER_CONNECT);
props.setProperty("log.dirs", kafkaTmpLogsDir.getAbsolutePath());
//advertised.listeners=PLAINTEXT://xxx.xx.xx.xx:por
// flush every message.
// flush every 1ms
props.setProperty("log.default.flush.scheduler.interval.ms", "1");
props.setProperty("log.flush.interval", "1");
props.setProperty("log.flush.interval.messages", "1");
props.setProperty("replica.socket.timeout.ms", "1500");
props.setProperty("auto.create.topics.enable", "true");
props.setProperty("num.partitions", "1");
KafkaConfig kafkaConfig = new KafkaConfig(props);
KafkaServerStartable kafka = new KafkaServerStartable(kafkaConfig);
kafka.startup();
System.out.println("start kafka ok "+kafka.serverConfig().numPartitions());
}
}
Thanks.
With kafka 0.11, if you set num.partitions to 1 you also need to set the following 3 settings:
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
That should be obvious from your server logs when running 0.11.

How to shutdown a Kafka ConsumerConnector

I have a system that pulls messages from a Kafka topic, and when it's unable to process messages because some external resource is unavailable, it shuts down the consumer, returns the message to the topic, and waits some time before starting the consumer again. The only problem is, shutting down doesn't work. Here's what I see in my logs:
2014-09-30 08:24:10,918 - com.example.kafka.KafkaConsumer [info] - [application-akka.actor.workflow-context-8] Shutting down kafka consumer for topic new-problem-reports
2014-09-30 08:24:10,927 - clients.kafka.ProblemReportObserver [info] - [application-akka.actor.workflow-context-8] Consumer shutdown
2014-09-30 08:24:11,946 - clients.kafka.ProblemReportObserver [warn] - [application-akka.actor.workflow-context-8] Sending 7410-1412090624000 back to the queue
2014-09-30 08:24:12,021 - clients.kafka.ProblemReportObserver [debug] - [kafka-akka.actor.kafka-consumer-worker-context-9] Message from partition 0: key=7410-1412090624000, msg=7410-1412090624000
There's a few layers at work here, but the important code is:
In KafkaConsumer.scala:
protected def consumer: ConsumerConnector = Consumer.create(config.asKafkaConfig)
def shutdown() = {
logger.info(s"Shutting down kafka consumer for topic ${config.topic}")
consumer.shutdown()
}
In the routine that observes messages:
(processor ? ProblemReportRequest(problemReportKey)).map {
case e: ConnectivityInterruption =>
val backoff = 10.seconds
logger.warn(s"Can't connect to essential services, pausing for $backoff", e)
stop()
// XXX: Shutdown isn't instantaneous, so returning has to happen after a delay.
// Unfortunately, there's still a race condition here, plus there's a chance the
// system will be shut down before the message has been returned.
system.scheduler.scheduleOnce(100 millis) { returnMessage(message) }
system.scheduler.scheduleOnce(backoff) { start() }
false
case e: Exception => returnMessage(message, e)
case _ => true
}.recover { case e => returnMessage(message, e) }
And the stop method:
def stop() = {
if (consumerRunning.get()) {
consumer.shutdown()
consumerRunning.compareAndSet(true, false)
logger.info("Consumer shutdown")
} else {
logger.info("Consumer is already shutdown")
}
!consumerRunning.get()
}
Is this a bug, or am I doing it wrong?
Because your consumer is a def. It creates a new Kafka instance and shut that new instance down when you call it like consumer.shutdown(). Make consumer a val instead.