Is there any way to get the time when any consumer group last committed offsets? In other words, can we find out the timestamp of the last commit? If there is, does the Kafka Java library allow it to be obtained?
I have tried to find this out but haven't found anything satisfactory.
The client Offset Explorer shows the timestamp for last commit by a consumer group. Does anyone have a clue how it manages to do so? And how similar clients could be working to fetch those details not available through programming apis.
Related
When tailing the oplog, I see a timestamp for each event. Change streams have advantages over tailing the oplog directly, so I'd like to use those. However, I can't find any way of figuring out when a change occurred. This would be problematic if my script went down for a while and then resumed using a resume token.
Is there any way of getting that timestamp?
I can't find any way of figuring out when a change occurred.
Currently (MongoDB v3.6), there is no way to find out the timestamp of an event returned by the server from the receiving end. This is because the cluster time's timestamp is actually embedded into the resume token as a binary format.
There is a ticket to request adding a tool to inspect this resume token SERVER-32283. Feel free to watch/upvote for updates on the ticket.
This would be problematic if my script went down for a while and then resumed using a resume token.
When resuming Change Streams using the resume token, it will resume from that point forward. This is because the token contains the cluster time, and the server is aware when the last operation the token has 'seen'.
You also said down for a while. Change streams is build upon Replica Set Oplog, which also means the resumability nature of change streams is limited by the size of the oplog window.
For example, if the last cached token time was 24 hours ago and the oplog size is only 12 hours, your application will not be able to utilise the change streams resume token. Since you're comparing change streams with tailing the oplog, in this regard both would have had the same potential issue.
If this is a real concern for your use case, please adjust your oplog window size accordingly. i.e. if the receiving client will have a potential downtime greater than the oplog window time.
See also Change Streams Production Recommendations
I've built a Kafka Streams application. It's my first one, so I'm moving out of a proof-of-concept mindset into a "how can I productionalize this?" mindset.
The tl;dr version: I'm looking for kafka streams deployment recommendations and tips, specifically related to updating your application code.
I've been able to find lots of documentation about how Kafka and the Streams API work, but I couldn't find anything on actually deploying a Streams app.
The initial deployment seems to be fairly easy - there is good documentation for configuring your Kafka cluster, then you must create topics for your application, and then you're pretty much fine to start it up and publish data for it to process.
But what if you want to upgrade your application later? Specifically, if the update contains a change to the topology. My application does a decent amount of data enrichment and aggregation into windows, so it's likely that the processing will need to be tweaked in the future.
My understanding is that changing the order of processing or inserting additional steps into the topology will cause the internal ids for each processing step to shift, meaning at best new state stores will be created with the previous state being lost, and at worst, processing steps reading from an incorrect state store topic when starting up. This implies that you either have to reset the application, or give the new version a new application id. But there are some problems with that:
If you reset the application or give a new id, processing will start from the beginning of source and intermediate topics. I really don't want to publish the output to the output topics twice.
Currently "in-flight" data would be lost when you stop your application for an upgrade (since that application would never start again to resume processing).
The only way I can think to mitigate this is to:
Stop data from being published to source topics. Let the application process all messages, then shut it off.
Truncate all source and intermediate topics.
Start new version of application with a new app id.
Start publishers.
This is "okay" for now since my application is the only one reading from the source topics, and intermediate topics are not currently used beyond feeding to the next processor in the same application. But, I can see this getting pretty messy.
Is there a better way to handle application updates? Or are my steps generally along the lines of what most developers do?
I think you have a full picture of the problem here and your solution seems to be what most people do in this case.
During the latest Kafka-Summit this question has been asked after the talk of Gwen Shapira and Matthias J. Sax about Kubernetes deployment. The responses were the same: If your upgrade contains topology modifications, that implies rolling upgrades can't be done.
It looks like there is no KIP about this for now.
Our requirement is such that if a kafka-stream app is consuming a partition, it should start it's consumption from latest offset of that partition.
This seems like do-able using
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
Now, let's say using above configuration, the kafka-stream app started consuming data from latest offset for a partition. And after some time, the app crashes. When the app comes back live, we want it to consume data from the latest offset of that partition, instead of the where it left last reading.
But I can't find anything that can help achieve it using kafka-streams api.
P.S. We are using kafka-1.0.0.
That is not supported out-of-box.
Configuration auto.offset.reset only triggers, if there are no committed offsets and there is no config to change this behavior.
You could manipulate offsets manually before startup
using bin/kafka-consumer-groups.sh though—the application.id is the
group.id and and you could "seek to end" before you restart the application.
Update:
Since 1.1.0 release, you can use bin/kafka-streams-application-reset.sh tool to set starting offsets. To use the tool, the application must be offline. (cf: https://cwiki.apache.org/confluence/display/KAFKA/KIP-171+-+Extend+Consumer+Group+Reset+Offset+for+Stream+Application)
I am investigating data reprocessing with Kafka Streams. There is a nice tool available for data reprocessing with resetting the streaming application: Application Reset tool.
But this tool usually resets the application state to zero and reprocesses everything again from scratch.
There are scenarios when we want to reprocess the data from a specific point, i.e.:
Bug fix in the current application
Updating the application with some additional processor and run with the same application ID
As in Flink also, we have Savepoints concepts, which can restore the previous operator states and add the new operators without any error.
I also referred the following documents :
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Data+%28Re%29Processing+Scenarios
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
Would like to know :
Is there any checkpointing type of mechanism available in KStream?
How can we re-run the Kafka Streams application from a specific point?
What happens if we change the code in one of the application instance and run with the old application ID?
Kafka Streams does not have a savepoint concept at this point (version 1.0).
Not at the moment (v1.0)
Yes. In next release this will be part of the reset tool directly. In 1.0, you can use bin/kafka-consumer-groups.sh to commit a start offset for you application (note, application.id == group.id). For older Kafka version, you could build a custom tool to commit start offsets
In general, it breaks. Thus, you need to use a new application.id (it's a know issue and will be fixed in future releases).
we have to use message topic in WIoTP rules. That is if a certain event is triggered X times in Y mins then trigger an action. I am not able to see an option of selecting the message topic while creating a rule. Can someone suggest how this can be done?
It's not currently possible to select a message topic in the Rules configuration. The documentation for triggering rules when the conditions are met N times in the selected time interval can be found here:
https://console.bluemix.net/docs/services/IoT/cloud_analytics.html#conditional