pubSubSource: Receving the same message twice - apache-kafka

Description
I have a pubSubSource connector in Kafka Connect Distributed mode that is simply reading from a PubSub subscription and writing into a Kafka topic. The issue is, even if I am publishing one message to GCP PubSub, I am getting this message written twice in my Kafka topic.
How to reproduce
Deploy Kafka and Kafka connect
Create a connector with below pubSubSource configurations:
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
"name": "pubSubSource",
"config": {
"connector.class":"com.google.pubsub.kafka.source.CloudPubSubSourceConnector",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"tasks.max":"1",
"cps.subscription":"pubsub-test-sub",
"kafka.topic":"kafka-sub-topic",
"cps.project":"test-project123",
"gcp.credentials.file.path":"/tmp/gcp-creds/account-key.json"
}
}'
Below are the Kafka-connect configurations:
"plugin.path": "/usr/share/java,/usr/share/confluent-hub-components"
"key.converter": "org.apache.kafka.connect.json.JsonConverter"
"value.converter": "org.apache.kafka.connect.json.JsonConverter"
"key.converter.schemas.enable": "false"
"value.converter.schemas.enable": "false"
"internal.key.converter": "org.apache.kafka.connect.json.JsonConverter"
"internal.value.converter": "org.apache.kafka.connect.json.JsonConverter"
"config.storage.replication.factor": "1"
"offset.storage.replication.factor": "1"
"status.storage.replication.factor": "1"
Publish a message to the PubSub topic using the below command:
gcloud pubsub topics publish test-topic --message='{"someKey":"someValue"}'
Read messages from the destination Kafka topics:
/usr/bin/kafka-console-consumer --bootstrap-server xx.xxx.xxx.xx:9092 --topic kafka-topic --from-beginning
# Output
{"someKey":"someValue"}
{"someKey":"someValue"}
Why is this happening, is there something that I am doing wrong?

I found below info at https://cloud.google.com/pubsub/docs/faq and seems you are facing the same issue. Could you try producing large message and see if the result is the same?
Details from the link:
Why are there too many duplicate messages?
Pub/Sub guarantees at-least-once message delivery, which means that occasional duplicates are to be expected. However, a high rate of duplicates may indicate that the client is not acknowledging messages within the configured ack_deadline_seconds, and Pub/Sub is retrying the message delivery. This can be observed in the monitoring metrics pubsub.googleapis.com/subscription/pull_ack_message_operation_count for pull subscriptions, and pubsub.googleapis.com/subscription/push_request_count for push subscriptions. Look for elevated expired or webhook_timeout values in the /response_code. This is particularly likely if there are many small messages, since Pub/Sub may batch messages internally and a partially acknowledged batch will be fully redelivered.
Another possibility is that the subscriber is not acknowledging some messages because the code path processing those specific messages fails, and the Acknowledge call is never made; or the push endpoint never responds or responds with an error.
How do I detect duplicate messages?
Pub/Sub assigns a unique message_id to each message, which can be used to detect duplicate messages received by the subscriber. This will not, however, allow you to detect duplicates resulting from multiple publish requests on the same data. Detecting those will require a unique message identifier to be provided by the publisher. See Pub/Sub I/O for further discussion.

Sometimes due to acknowledgement delay Google pub sub keeps on retrying the messages to kafka broker. This issue can usually be avoided by configuring Acknowledgement deadline in Google Pub sub subscription. This will ensure your messages get enough time for acknowledgement and duplicates issue will get resolved.
For more details you can check out this link from confluent documentation
https://docs.confluent.io/kafka-connect-gcp-pubsub/current/overview.html#too-many-duplicates

Related

Kafka Connect Skipping Messages due to Confluent Interceptor

I am seeing following messages in my connect log
WARN Monitoring Interceptor skipped 2294 messages with missing or invalid timestamps for topic TEST_TOPIC_1. The messages were either corrupted or using an older message format. Please verify that all your producers support timestamped messages and that your brokers and topics are all configured with log.message.format.version, and message.format.version >= 0.10.0 respectively. You may also experience this if you are consuming older messages produced to Kafka prior to any of those changes taking place. (io.confluent.monitoring.clients.interceptor.MonitoringInterceptor)
I have changed my kafka broker with this
KAFKA_INTER_BROKER_PROTOCOL_VERSION: 0.11.0
KAFKA_LOG_MESSAGE_FORMAT_VERSION: 0.11.0
I am guessing this is reducing my overall producer throughput and I am trying load testing.
PS:
I don't want to remove the confluent interceptor because it helps me with throughput and consumer lag.
CONNECT_PRODUCER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor"
CONNECT_CONSUMER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor"
Any way to not skip those messages, I am using pepperbox to produce messages and it doesn't have timestamp
{
"messageId":{{SEQUENCE("messageId", 1, 1)}},
"messageBody":"{{RANDOM_ALPHA_NUMERIC("abcedefghijklmnopqrwxyzABCDEFGHIJKLMNOPQRWXYZ", 2700)}}",
"messageCategory":"{{RANDOM_STRING("Finance", "Insurance", "Healthcare", "Shares")}}",
"messageStatus":"{{RANDOM_STRING("Accepted","Pending","Processing","Rejected")}}"
}
Thanks in advance!
Look at the Kafka version in the pom, and you'll see it's using Kafka 0.9
Timestamps were added to Kafka as of 0.10.2.
As the error says Please verify that all your producers support timestamped messages.
Recompile the project with a new version, and all produced records will automatically have a timestamp, and therefore not be skipped.
Or, use a different tool like JMeter or Kafka Connect Datagen.

Kafka: "Broker failed to validate record" after increasing partition

I had increased the partition of an existing Kafka topic via terraform. The partition size had increased successfully however when I test the connection to the topic, I'm getting a "Broker failed to validate record"
Testing method:
echo "test" | kcat -b ...
**sensitive content has been removed**
...
% Auto-selecting Producer mode (use -P or -C to override)
% Delivery failed for message: Broker: Broker failed to validate record
I had tried to search up online and came across something called schema validation configuration: https://docs.confluent.io/cloud/current/sr/broker-side-schema-validation.html
Is there something I need to do after increasing the partition? ie flush some cache?
You need to ask your Kafka cluster administrator if they have schema validation enabled, but increasing partitions shouldn't cause that. (This is a feature of Confluent Server, not Apache Kafka).
If someone changed the schema in the schema registry for your topic, or validation has suddenly been enabled, and you are sending a record from an "old" schema (or not correct schema), then the broker would "fail to validate" the record.

Does errors.deadletterqueue.topic.name work for source connector

Does "errors.deadletterqueue.topic.name" work for source connector? I tested with JDBC sink connector and it works, but I don't find a record which has serialization error goes to dead letter queue.
I use Debezium Connector for MongoDB and apache-kafka-connect version is 2.4.0.
The rest error handling config:
"errors.tolerance": "all",
"errors.log.enable": "false",
"errors.deadletterqueue.topic.name": "test-dlq",
"errors.deadletterqueue.context.headers.enable": "true"
apache-kafka-connect has included error handling options, including the functionality to route messages to a dead letter queue since Apache Kafka 2.0 through KIP-298: Error Handling in Connect. According this KIP, Dead Letter Queue is supported for Sink Connectors only.
Also you can check Kafka Connect docs:
errors.deadletterqueue.topic.name: The name of the topic to be used as the dead letter queue (DLQ) for messages that result in an error when processed by this sink connector, or its transformations or converters. The topic name is blank by default, which means that no messages are to be recorded in the DLQ.
Great article about Error Handling and Dead Letter Queues by Robin Moffatt

Kafka message not getting consumed

I have a single Kafka node (v 0.10.2) running.We have configured log.retention.hours=480 but messages are not available to the consumers before the expiry time of the messages.(Ex message is pushed on 1st July and I have started my consumer on 4th July. Before starting the consumer I have verified messages are there through yahoo Kafka monitoring service. But on starting the consumer it keeps on polling and waiting for the messages).
Below mentioned are broker configuration :
broker.id=1
delete.topic.enable=true
num.network.threads=5
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=xxx
num.partitions=1
num.recovery.threads.per.data.dir=5
log.retention.hours=480
offsets.retention.minutes=43200
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=x.x.x.x:2181
zookeeper.connection.timeout.ms=30000
zookeeper.session.timeout.ms=30000
I have googled it but was not able to find the reason. Please let me know why this is happening and how to fix it.
There are some reasons why a message is not being consumed. Here are some ways you can debug:
One is as said, you might be consuming from the latest offset and hence you will be waiting. In this case, try producing the messages to the topic and check if they are consumed.
Next, ensure that your consumer subscribes to the partitions to which the messages are produced. If you are using Streams API then you need not worry about this since you will subscribe to all the partitions.
There could be some consumer configuration problem. Try consuming using the kafka-console-consumer with --from-beginning flag.
If there is no output means that the messages are probably not there or you have some difficulty connecting to it.
You can repeat this test both on the Kafka machine and from outside.
Check your connection from your consumer client to your broker which is the leader of the topic partitions you are consuming from. If there is a problem connecting to it, you must be getting an exception (like timeout for fetching the data).
For this, you can use telnet broker_ip:port
Sometimes, it may happen that your bootstrap-server may be accessible, but not the other brokers on which your topic partitions are lead.
Check your iptables rules to see if the brokers port is blocked or not. See
What happens if the leader is not dead but unable to receive messages in Kafka? SPoF?

Messages sent to Kafka REST-Proxy being rejected by "This server is not the leader for that topic-partition" error

We have been facing some trouble and different understanding between the development team and the environment support team regarding Kafka rest-proxy from the confluent platform.
First of all, we have an environment of 5 Kafka brokers, with 64 partitions and replication factor of 3.
It happens that our calls to rest-proxy are all using the following structure for now:
curl -X POST \
http://somehost:8082/topics/test \
-H 'content-type: application/vnd.kafka.avro.v1+json' \
-d '{
"value_schema_id":1,
"records":[
{ "foo":"bar" }]}'
This kind of call is working for 98.4% of the calls and I noticed that when I try to make this call over 2k times we don't receive any OK response from partition 62 (exactly 1.6% of the partitions).
This error rate used to be 10.9% when we had 7 partitions returning errors right before support team recycled schema-registry.
Now, when the call goes to the partition 62, we receive the following answer:
{
"offsets": [
{
"partition": null,
"offset": null,
"error_code": 50003,
"error": "This server is not the leader for that topic-partition."
}
],
"key_schema_id": null,
"value_schema_id": 1
}
The error is the same when I try to send the messages to the specific partition adding "/partitions/62" to the URL.
Support says rest-proxy is not smart enough ("it's just a proxy", they say) to elect a valid partition and post it to the leader broker of that partition.
They said it randomly selects the partition and then randomly select the broker to post it (which can lead it to post to replicas or even brokers that doesn't have the partition).
They recommended us to change our calls to get topic metadata before posting the messages and then inform the partition and broker and handle the round-robin assignment on the application side, which doesn't make sense to me.
On the Dev side, my understanding is that rest-proxy uses the apache kafka-client to post the messages to the brokers and thus is smart enough to post to the leader broker to the given partition and it also handles the round-robin within the kafka-client lib when the partition is not informed.
It seems to me like an environment issue related to that partition and not to the call app itself (as it works without problem in other environments with same configuration).
To sum up, my questions are:
Am I correct when I say that rest-proxy is smart enough to handle the partition round-robin and posting to the leader?
Should the application be handling the logic in question 1? (I don't see the reason for using rest-proxy instead of kafka-client directly in this case)
Does it look like a problem in environment orchestration for you too?
Hope it all was clear for you to give me some help!
Thanks in advance!
I do not use rest-proxy, but this error likely indicates that NotLeaderForPartitionException happens during the calls. This error indicates that the leader of the partition has changed but the producer still uses stale metadata. This error happenned to me when the replication between brokers failed due to internal error in Kafka server. This can be checked in the server logs.
In our case I checked the topic with ./kafka-topics.sh --describe --zookeeper zookeeper_ip:2181 --topic test and it showed that the replicas from one the broker are not in sync (ISR column). Restart of this broker helped, replicas became synchronised and the error dissapeared.