Need some clarification regarding error handling in kafka connect - apache-kafka

I'm looking over the following site KIP:298: Error Handling in Connect
In example 2, what does the following configuration will do? Bit more information or an example can help me out to understand.:-
# retry for at most 10 minutes times waiting up to 30 seconds between consecutive failures
errors.retry.timeout=600000
errors.retry.delay.max.ms=30000
And one more thing is, while dealing with sink connector, when I'm getting some errors due to duplicate records, it keeps on trying for a certain period, how to set our own limit of retries?
I tried by setting errors.retry.timeout=0 even though duplicate key error was retrying continuously for certain no.of.times, but if the error is because of schema or serializer it's not retrying.
And finally, errors.log.enable when this is true where does these logs are stored? I was checking in connect log, but not able to find the difference between default log and when the errors.log.enable is set to true.

Not sure how to fix your problem ,but when errors.log.enable=true, you should see 2 additional topic are created for your connector, yourconnector-error and yourconnector-success, you should be able to see the connector failure message in yourconnector-error topic.

Related

UpdateOne fails on client due to timeout, but MongoDB processes it anyway

One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:
Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.
This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.
I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.
Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?
Any help is appreciated.
UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.
In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.
In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.
Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.
I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}
And there is the error when the timeout happens writing the command:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}
As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).
So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.
If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.
This functionality has been around for years so unless you are using an ancient driver version you should already have it.
If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.
I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.
To add to #icza solution:
You can also get the error context deadline exceeded so check also for that.
A check for a context abortion would look something like this:
if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
...
}
My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.
Example:
result, err := database.collection.InsertOne(context, item)
if result != nil {
return result.InsertedID, err
}
return nil, err
If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.

Producer metric: record_error_rate ,when will this value be greater than zero

I am checking and validating the metrics of producer and consumers. But metric record_error_rate is always seems to be zero in all my experiments.
If someone has any ideas when will this be non-zero then please let me know, also the steps to reproduce it.
record-error-rate indicates the average per-second number of record sends that resulted in errors for a topic.
If you get 0 then it simply means that no record has failed from the producer side.
Now if for testing purposes you want to intentionally get a non-zero value then simply try to generate a faulty message from the producer side. For example, try to send a message with a wrong schema, withoug authentication (if SSL is enabled), etc.

Kafka connect error handling and improved logging

I was trying to leverage some enhancements in Kafka connect in 2.0.0 release as specified by this KIP https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect and I came across this good blog post by Robin https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues.
Here are my questions
I have set errors.tolerance=all in my connector config. If I understand correctly, it will not fail for bad records and move forward. Is my understanding correct?
In my case, the consumer doesn't fail and stays in the RUNNING state (which is expected) but the consumer offsets don't move forward for the paritions with the bad records. Any guess why this may be happening?
I have set errors.log.include.messages and errors.log.enable to true for my connector but I don't see any additional logging for the failed records. The logs are similar to what I used to see before enabling these properties. I didn't see any message like this https://github.com/apache/kafka/blob/5a95c2e1cd555d5f3ec148cc7c765d1bb7d716f9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/errors/LogReporter.java#L67
Some Context:
In my connector, I do some transformations, validations for every record and if any of these fail, I throw RetriableException. Earlier I was throwing RuntimeException but I changed to RetriableException after reading the comments for RetryWithToleranceOperator class.
I have tried to keep it brief but let me know if any additional context is required.
Thanks so much in advance!

Kafka Streams Deserialization Handler

I am trying to use the LogAndContinueExceptionHandler on deserialization. It works fine when an error occurs by successfully logging it and continue. However, lets say I have a continuous stream of errors on my incoming messages, and I stop and restart the kafka streams application, then I see that the messages which failed and already logged in my last attempt re-appear again (they are getting logged again). It is more problematic if I try to send the messages in error to a DLQ. On a restart, they are sent again to DLQ. As soon as I have a good record coming in, it looks like the offset moves further and not seeing the already logged messages again on another restart. Is there a way to manually commit within the streams application? I tried to use the ProcessorContext#commit(), but that doesn't seem to have any effect.
I reproduced this behavior by running the sample provided here: https://github.com/confluentinc/kafka-streams-examples/blob/4.0.0-post/src/main/java/io/confluent/examples/streams/WordCountLambdaExample.java
I changed the incoming value Serde to Serdes.Integer().getClass().getName() to force a deserialization error on input and reduced the commit interval to just 1 second. Also added the following to the config.
streamsConfiguration.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler.class);.
Once it fails and when I restart the app, the same records failed before appear on the logs again. For example, I see the following output on the console each time I restart the app. I would expect these to be not tried again as we already skipped them before.
2018-01-27 15:24:37,591 WARN wordcount-lambda-example-client-StreamThread-1 o.a.k.s.p.i.StreamThread:40 - Exception caught during Deserialization, taskId: 0_0, topic: words, partition: 0, offset: 113
org.apache.kafka.common.errors.SerializationException: Size of data received by IntegerDeserializer is not 4
2018-01-27 15:24:37,592 WARN wordcount-lambda-example-client-StreamThread-1 o.a.k.s.p.i.StreamThread:40 - Exception caught during Deserialization, taskId: 0_0, topic: words, partition: 0, offset: 114
org.apache.kafka.common.errors.SerializationException: Size of data received by IntegerDeserializer is not 4
Looks like when deserialization exceptions occur, this flag is never set to be true here: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L228. It seems like it only becomes true once processing succeeds. That might be the reason why commit is not happening even after I manually call processorContext#commit().
Appreciate any help on this mater.
Thank you.

Retry logic in kafka consumer

I have a use case where i consume certain logs from a queue and hit some third party API's with some info from that log , in case the third party system is not responding properly i wish to implement a retry logic for that particular log .
I can add a time field and repush the message to the same queue and this message will get again consumed if its time field is valid i.e less than current time and if not then get pushed again into queue.
But this logic will add same log again and again until retry time is correct and the queue will grow unnecessarily.
Is there is better way to implement retry logic in Kafka ?
You can create several retry topics and push failed task there. For instance you can create 3 topics with different delays in mins and rotate the single failed task till the max attempt limit reached.
‘retry_5m_topic’ — for retry in 5 minutes
‘retry_30m_topic’ — for retry in 30 minutes
‘retry_1h_topic’ — for retry in 1 hour
See more for details: https://blog.pragmatists.com/retrying-consumer-architecture-in-the-apache-kafka-939ac4cb851a
In consumer, if it throws an exception, produce another message with attempt number 1. so next time when it is consumed, it has the property of attempt no 1. Handle it in the producer that, if it attempts more than your retry count, then stop producing it.
Yes, this could be one straight solution that I also thought of. But with this, we will end up in creating many topics as it is possible that message processing will fail again.
I solved this problem by mapping this use case to Rabbit MQ. In rabbit MQ we have the concept of retry exchange where if a message processing fails from an exchange then u can send it to a retry exchange with a TTL. Once TTL gets expired the message will move back to the main exchange and is ready to be processed again.
I can post some examples explaining how we can implement an exponential backoff message processing using Rabbit MQ.