I have a simple kafka configuration:
default console-producer writes to one topic with one partition
custom processor (which implements org.apache.kafka.streams.processor.Processor) reads from kafka and process messages.
During processing I validate a batch of messages, if validation fails I write to log the key of the first message of a failed batch.
Is it possible in kafka to query for a message with given key/offset? Or get 100 messages after that message?
Related
I have a concern regarding the Debizum outbox pattern ,that is when the Kafka connect consumes messages from outbox table and trying to produce to kafka topic ,if the kafka brokers are down does the debizum retries the message delivery or just fail?
Do we have any configuration for kafka produce fail scenario?
Correct if my understanding is wrong.
Kafka spring batch is important when we are consuming records through topics and writing to file system or Database.The write process occurs on batch mode on file system or database end and not on Kafka end.Thats why spring kafka batch support came in picture.
It is same in case of reading data in batch from file and writing to kafka topics.Here also the batch process is on file end.
So if we neither consuming the data from any file nor we are writing to any file in between consumer and producer process so its same like like using normal kafka message listener and message writer?
I read polling factor decides the data consumption in batch in Kafka .Even if we enable batch listener mode then in that case its related to acknowledgement of multiple messages at once.
I'm new to Kafka, I have been studying the behaviour of the Kafka when sending messages to it while it is stopped.
The scenario that I face is that I stop the Kafka using 'Kubectl delete StatefulSet kafka_kf'. Then I send a number of messages to the Kafka using java Kafka Producer. Then I start the Kafka again, these messages that were sent to Kafka do appear immediately in the consumer at the moment when I start the Kafka. Any idea what happens within the Kafka in this case? and how to prevent these messages from appearing in the consumer? These messages cause a duplication issue later, that's why I need them to not appear.
I see the messages appear in the consumer by using the consumer opened with the command:
kubectl exec -ti test -- ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --isolation-level read_committed --topic testtopic
The peace of code that is used to send messages to kafka is:
producer.send(message)
First, I think it is important to understand that producer.send() is an asynchronous call, so it does not block. Second, the send() method does not actually push the message to the brokers but instead places the message in a binary queue in local memory. There is a separate binary queue for each
partition in the topics that the producer communicates with. The records are actually pushed to the brokers by an internal background thread on the producer side that will be triggered by configurable batching thresholds. It is this action that is waiting for the acks from the brokers (as configured by the acks setting), not the send() method.
[Source: Confluent Training - Developer Skills for Building Apache Kafka]
When Kafka is not available you will get a TimeoutException in your producer. However, this Exception can be handled by a retry and the producer configuration retries is by default set to 2147483647.
As soon as you make Kafka available, your producer is then able to actually send the messages to Kafka and your Consumer will receive them.
If you do not want to receive those messages you need to set the KafkaProducer configuration retries=0.
To understand more on the Producer Callback Exceptions, you could look into another answer of mine.
Edit for new question in comment:
Is there any way to find whether a message (or all the messages) was successfully sent or not?
You can define a custom Callback class like below when sending the data. This callback will throw an Exception if something went wrong with the producing of the messages.
class ProducerCallback extends Callback {
#Override
override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
if (e != null) {
e.printStackTrace()
}
}
}
producer.send(message, new ProducerCallback)
As an alternative you could simply call
producer.send(message).get()
as this will block until you have received all acknowledgments from Kafka broker (see KafkaProducer configuration acks).
I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.
Scenario :
Stream create [StreamName] --definition " Kafka -zkconnect=10.10.10.1:2181 --topic=<topic name> | MyCompositeModule " --deploy
We are running this stream in distributed mode and redis is the transport bus.
Per my understanding, kafka source maintains the offsets for messages consumed by MyCompositeModule (which is a sink as its a module created through 'module compose' process) through [streamname]-kafka-offsets topic. Which is unreadable and I would appreciate if there was a way to read the data from this topic.
Also, when I push the messages from kafka source the messages are queued in redis transport, then the module fetches them from this queue.
If kafka consumer module starts consuming 1000 messages from kafka redis queue- and composite module fails after receiving 10 messages or randomly processed 10 messages.So how to identify remaining 990 [ 1000 ( consumed ) - 10 (processed) = 990 ] unprocessed messages .
Even if we check kafka offsets it will show consumed messages count. example: -kafka.offsets - which is unreadable in our process.
So all the unprocessed messages will be in Redis queue as we are using Redis in SpringXD. So can anyone help me out how to identify the unprocessed messages and
re sending it to composite module to process.
Basically, I am looking for the recommendations on an elegant solution for robust delivery, adding failure handling capability in spring xd stream when consuming from from kafka source.
If the messages are effectively consumed from Kafka and moved to the bus, then they will be acknowledged as consumed from the offset manager's perspective.
You can try enabling retry and dead lettering for the Redis Message Bus as described here: http://docs.spring.io/spring-xd/docs/current/reference/html/#error-handling-message-delivery-failures.
Cheers,
Marius