How to identify unprocessed messages after KAFKA topic consumption - offset

Scenario :
Stream create [StreamName] --definition " Kafka -zkconnect=10.10.10.1:2181 --topic=<topic name> | MyCompositeModule " --deploy
We are running this stream in distributed mode and redis is the transport bus.
Per my understanding, kafka source maintains the offsets for messages consumed by MyCompositeModule (which is a sink as its a module created through 'module compose' process) through [streamname]-kafka-offsets topic. Which is unreadable and I would appreciate if there was a way to read the data from this topic.
Also, when I push the messages from kafka source the messages are queued in redis transport, then the module fetches them from this queue.
If kafka consumer module starts consuming 1000 messages from kafka redis queue- and composite module fails after receiving 10 messages or randomly processed 10 messages.So how to identify remaining 990 [ 1000 ( consumed ) - 10 (processed) = 990 ] unprocessed messages .
Even if we check kafka offsets it will show consumed messages count. example: -kafka.offsets - which is unreadable in our process.
So all the unprocessed messages will be in Redis queue as we are using Redis in SpringXD. So can anyone help me out how to identify the unprocessed messages and
re sending it to composite module to process.
Basically, I am looking for the recommendations on an elegant solution for robust delivery, adding failure handling capability in spring xd stream when consuming from from kafka source.

If the messages are effectively consumed from Kafka and moved to the bus, then they will be acknowledged as consumed from the offset manager's perspective.
You can try enabling retry and dead lettering for the Redis Message Bus as described here: http://docs.spring.io/spring-xd/docs/current/reference/html/#error-handling-message-delivery-failures.
Cheers,
Marius

Related

Kafka Consumer as a service which continuously polls messages from a Kafka Topic

I need to create an independent server which continuously polls data from a Kafka Topic. Till now I have created a vertx server and a Kafka Consumer. The issue is when we start the server it drains the queue and polls all the messages currently present but when new messages will arrive later it wont be able to consume them. I need to implement some logic in my server which able it to continuously poll the data and drain the queue.

Apache Kafka: how to configure message buffering properly

I run a system comprising an InfluxDB, a Kafka Broker and data sources (sensors) producing time series data. The purpose of the broker is to protect the database from inbound event overload and as a format-agnostic platform for ingesting data. The data is transferred from Kafka to InfluxDB via Apache Camel routes.
I would like to use Kafka a intermediate message buffer in case a Camel route crashes or becomes unavailable - which is the most often error in the system. Up to now, I didn’t achieve to configure Kafka in a manner that inbound messages remain available for later consumption.
How do I configure it properly?
The messages will retain in Kafka topics based on its retention policies (you can choose between time or byte size limits) as described in the Topic Configurations. With
cleanup.policy=delete
Retention.ms=-1
the messages will in a Kafka topic will never be deleted.
Then your camel consumer will be able to re-read all messages (offsets) if you select a new consumer group or reset the offsets of the existing consumer group. Otherwise, your camel consumer might auto commit the messages (check corresponding consumer configuration) and it will not be possible to re-read offsets again for the same consumer group.
To limit the consumption rate of the camel consumer you may adjust configurations like maxPollRecords or fetchMaxBytes which are described in the docs.

Google Pubsub vs Kafka comparison on the restart of pipeline

I am trying to write an ingestion application on GCP by using Apache Beam.I should write it in a streaming way to read data from Kafka or pubsub topics and then ingest to datasource.
while it seems straight forward to write it with pubsub and apache beam but my question is what would happen if my ingestion fails or to be restarted and if it again reads all data from the start of pubsub topic or like kafka it can read from latest committed offsets in the topic?
Pub/sub messages are persisted until they are delivered and acknowledge by the subscribers which receives pending messages from its subscription. Once the message is acknowledge, it's removed from the subscription's queue.
For more information regarding the message flow, check this document
Hope it helps.

Kafka Streams application stops working after no message have been read for a while

I have noticed that my Kafka Streams application stops working when it has not read new messages from the Kafka topic for a while. It is the third time that I have seen this happen.
No messages have been produced to the topic since 5 days. My Kafka Streams application, which also hosts a spark-java webserver, is still responsive. However, the messages I produce to the Kafka topic are not being read by Kafka Streams anymore. When I restart the application, all messages will be fetched from the broker.
How can I make my Kafka Streams Application more durable to this kind of scenario? It feels that Kafka Streams has an internal "timeout" after which it closes the connection to the Kafka broker when no messages have been received. I could not find such a setting in the documentation.
I use Kafka 1.1.0 and Kafka Streams 1.0.0
Kafka Streams do not have an internal timeout to control when to permanently close a connection to the Kafka broker; Kafka broker, on the other hand, does have some timeout value to close idle connections from clients. But Streams will keep trying to re-connect once it has some processed result data that is ready to be sent to the brokers. So I'd suspect your observed issue came from some other causes.
Could you share your application topology sketch and the config properties you used, for me to better understand your issue?

Put() vs Flush() in Kafka Connector Sink Task

I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.