Flink kafka connector 0.11.0 - apache-kafka

I am trying to get on Flink kafka connector 0.11 but it keep throwing me this error when running the job.
java.lang.RuntimeException: Error while confirming checkpoint
at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1260)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
As I understood from kafka documentation the transaction timeout must be larger than the checkpoint interval, but smaller than the broker transaction.max.timeout.ms.
My cluster has setup as below:
Flink version 1.4.2
Application with flink-connector-kafka-0.11_2.11
Checkpoint interval : 5000ms
Observed End-to-end checkpoint time: 2s
Kafka producer config:
transactional.id : tx-kafka-topic1
transaction.timeout.ms : 30000
acks: all
enable.idempotence : true
retries: 3
max.in.flight.requests.per.connection : 1
Kafka broker (kafka_2.11-1.0.0-cp1.jar) with server config:
transaction.max.timeout.ms=120000
transaction.state.log.replication.factor=3
It seem to me the interval are not overlapped with each other, but the job still failed with error above. Appreciate if someone can point me to the right direction.

Related

When Kafka Topic partition reassignment, Flink job fails continuously

env
kafka 1.0.1
flink 1.7.1
trouble
I use topic with 200 partitions. and flink uses this topic.
Recently, I do manual partition reassignment.
When i reassigned partitions, Flink continuosly fails with this error.
error1.
[2021-07-28 18:21:15,926] WARN Attempting to send response via channel for which there is no open connection, connection id ..(kafka.network.Processor)
error2.
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for -126: 30042 ms has passed since batch creation plus linger time
error3.
java.lang.Exception: Error while triggering checkpoint 656 for Source: Custom Source -> Sink: ... (32/200)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1174)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not perform checkpoint 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:570)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1163)
... 5 more
Caused by: java.lang.Exception: Could not complete snapshot 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:422)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1113)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1055)
at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:729)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:641)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
... 7 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for ...-86: 30049 ms has passed since batch creation plus linger time
And When i restarted failed job, this error occurs continuously.
ClassLoader info: URL ClassLoader:
file: '/blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47' (invalid JAR: /blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47 (Too many open files))
Class not resolvable through given classloader.
So I restarted all mesos and flink cluster with zookeeper clearance.
Is there any cause to look for?
There were network issues with certain brokers in the cluster.
If a request for a specific partition is processed slowly due to a network issue, it is expected that the message will be displayed.
Subsequently, the job corresponding to the partition does not work properly, and it seems that the checkpoint issue of flink occurs.
This problem was solved by replacing the equipment of the broker.

Kafka: Producer threads get stuck

I have an Apache Nifi workflow that streams data into Kafka. My Kafka cluster is made of 5 nodes that uses SSL for encryption.
When there is a lot of data that is going throw, my Kafka producer (PublishKafkaRecord) freeze and stop working. I have to restart the processor and I am getting Threads errors.
I am using Kafka Confluent 5.3.1.
I am seeing these errors in the Kafka logs:
ERROR Uncaught exception in scheduled task 'transactionalID-expiration' (Kafka.utils.Kafkascheduler)
Retrying leaderEpoch request for partitions XXX-0 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
Could not find offset index file corresponding to log file XXX/*.log recovering segment and rebuilding index files (kafka.log.Log)
ERROR when handing request: .... __transaction_state
ERROR TransactionMetadata (... ) failed: this should not happen (kafka.coordinator.transaction.TransactionMetadata)
I cannot pin point to the actual error.
How can I fix Threads being stuck in Kafka?

What is the right configuration for Kafka with long processing messages and horizontally scaling

My application consumes messages that their processing duration average is between 5-15 min for 1 message with max.poll.records: 1.
When no messages there is 1 pod (K8S) and for each incoming message I'm scaling the pod adding 1 pod up to a maximum of 50. (I have 50 partitions)
Now I have a couple of issues:
When a new pod raises it takes a lot of time until I see the partition assigned. I can see pods that can start and terminate after a couple of minutes before they even took 1 message.
I can see when a lot of messages inserted to the topic (more than 10 (For this application it is a lot)) it start to get commitException and therefore to consume same message twice on different pods.
Error:
java.lang.IllegalStateException: This error handler cannot process 'org.apache.kafka.clients.consumer.CommitFailedException's; no record information is available
at org.springframework.kafka.listener.SeekUtils.seekOrRecover(SeekUtils.java:151) ~[spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at org.springframework.kafka.listener.SeekToCurrentErrorHandler.handle(SeekToCurrentErrorHandler.java:113) ~[spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.handleConsumerException(KafkaMessageListenerContainer.java:1361) [spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1063) [spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_252]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252]
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:1109) ~[kafka-clients-2.5.0.jar!/:na]
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:976) ~[kafka-clients-2.5.0.jar!/:na]
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1511) ~[kafka-clients-2.5.0.jar!/:na]
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doCommitSync(KafkaMessageListenerContainer.java:2311) [spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.commitSync(KafkaMessageListenerContainer.java:2306) [spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.commitIfNecessary(KafkaMessageListenerContainer.java:2292) [spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.processCommits(KafkaMessageListenerContainer.java:2106) [spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1097) [spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1031) [spring-kafka-2.5.6.RELEASE.jar!/:2.5.6.RELEASE]
... 3 common frames omitted
My Kafka config is:
spring.kafka:
bootstrap-servers: {{ .Values.kafka.service }}:9092
consumer:
client-id: XXX
group-id: YYY
auto-offset-reset: earliest
enable-auto-commit: false
fetch-min-size: 1
fetch-max-wait: 1
properties:
heartbeat.interval.ms: 3000
max.poll.interval.ms: 1200000
max.poll.records: 1
producer:
client-id: XXX

How to fix the JAVA Kafka Producer Error "Received invalid metadata error in produce request on partition" and Out of Memory when broker is down

I have been creating a Kafka Producer example using Java. I have been
sending normal data which is just "Test" + Integer as value to Kafka. I
have used the below properties and after I have started the Producer
Client and messages are on the way, during this I am killing the broker
and suddenly receiving the below error message instead of retrying.
Using 3 brokers and topic with 3 partitions and replication factor as 3
and no min-insync-replicas
Below are the properties configured config.put(ProducerConfig.ACKS_CONFIG, "all");
config.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1");
config.put(CommonClientConfigs.RETRIES_CONFIG, 60);
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG ,10000);
config.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG ,30000);
config.put(ProducerConfig.MAX_BLOCK_MS_CONFIG ,10000);
config.put(ProducerConfig.MAX_REQUEST_SIZE_CONFIG , 1048576);
config.put(ProducerConfig.BATCH_SIZE_CONFIG, 16384);
config.put(ProducerConfig.LINGER_MS_CONFIG, 0);
config.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 1073741824); // 1GB
and the result when I have killed all my brokers or sometimes one of the
broker is as below
**Error:**
WARN org.apache.kafka.clients.producer.internals.Sender - [Producer
clientId=producer-1] Got error produce response with correlation id 124
on topic-partition testing001-0, retrying (59 attempts left). Error:
NETWORK_EXCEPTION
27791 [kafka-producer-network-thread | producer-1] WARN
org.apache.kafka.clients.producer.internals.Sender - [Producer
clientId=producer-1] Received invalid metadata error in produce request
on partition testing001-0 due to
org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.. Going to request metadata update now
28748 [kafka-producer-network-thread | producer-1] ERROR
org.apache.kafka.common.utils.KafkaThread - Uncaught exception in thread
'kafka-producer-network-thread | producer-1':
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(Unknown Source)
at java.nio.ByteBuffer.allocate(Unknown Source)
at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate
(MemoryPool.java:30)
at org.apache.kafka.common.network.NetworkReceive.readFrom
(NetworkReceive.java:112)
at org.apache.kafka.common.network.KafkaChannel.receive
(KafkaChannel.java:335)
at org.apache.kafka.common.network.KafkaChannel.read
(KafkaChannel.java:296)
at org.apache.kafka.common.network.Selector.attemptRead
(Selector.java:560)
at org.apache.kafka.common.network.Selector.pollSelectionKeys
(Selector.java:496)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at org.apache.kafka.clients.producer.internals.Sender.run
(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run
(Sender.java:163)
at java.lang.Thread.run(Unknown Source)
I assume you are testing the producer. When a producer connect to the Kafka cluster you will pass all broker IPs and ports as a comma separated string. In your case there are three brokers. When producer try to connect to cluster, as part of initialization cluster controller responds with cluster metadata. Assume your producer only populating message to a single topic. Cluster maintains a leader among brokers for each and every topic. After identify the leader for the topic, your producer only going to communicate to the leader until it is live.
In your testing scenario, you are deliberately killing the broker instances. When it happens kafka cluster need to identify a new leader for your topic and controller has to pass the new meta data to your producer. If the metadata change quite frequently( in your case you may kill another broker mean while) producer may receive invalid metadata.

Apache kafka lot of WARN log show Unexpected error

I was running Kafka with 2 borker for cluster.
But I keep getting the WARN log.
I checked all my systems and there was no host using IP 10.8.7.1.
By the way, there was more IPs looks like from zookeeper or broker ?
If I shotdown on of Kafka, the WARNING log will be less
I am not familiar with Kafka and zookeeper, just getting starting and study
Any ideas?
Kafka version: 1.0.1
WARN log similar as below(get this kind of log about 10 secs),
[2018-04-19 09:13:08,342] WARN [SocketServer brokerId=0] Unexpected error from /10.8.7.1; closing connection (org.apache.kafka.common.network.Selector)
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 369295616 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:235)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:196)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:545)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:483)
at org.apache.kafka.common.network.Selector.poll(Selector.java:412)
at kafka.network.Processor.poll(SocketServer.scala:551)
at kafka.network.Processor.run(SocketServer.scala:468)
at java.lang.Thread.run(Thread.java:748)
One possible cause is that a Kafka producer on 10.8.7.1 is attempting to stream 0.369 GB of data in a batch instead of streaming. You may have to trace down the kafka producer and see whats going.
Hope this helps.