Is it possible to filter Apache Kafka messages by retention time? - apache-kafka

At an abstract point of view Apache Kafka stores data in topics. This data could be read by a consumer.
I'd like to have a (monitor)-consumer which greps data with a certain age. The monitor should send a warning to subsystems that records are still unread and would be discarded by Kafka if they reach retention time.
I couldn't find a suitable way until now.

You can use KafkaConsumer.offsetsForTimes() to map messages to dates.
For example, if you call it with the date of yesterday and it returns offset X, then any messages with an offset smaller than X are older than yesterday.
Then your logic can figure out from the current positions of your consumers if you are at risk of having unprocessed records discarded.
Note that there is currently a KIP under discussion to expose metrics to track that: https://cwiki.apache.org/confluence/display/KAFKA/KIP-223+-+Add+per-topic+min+lead+and+per-partition+lead+metrics+to+KafkaConsumer
http://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes-java.util.Map-

Related

Instruct Kafka Consumer App To Start Reading From Offset

If I have an application AppA that contains a Kafka consumer class, is it possible to instruct this consumer's behaviour pragmatically? For example, I may want to tell AppA over a rest API (or even via another topic) to wake up and begin consuming and processing messages from TopicB at offset or timestamp X and to stop at offset or timestamp Y. I may tell it to read the same sections of a topic repeatedly to perform different analysis of the data and I might want the consumer to sit idle when it's not performing an instruction.
Is it possible to control a consumer in this fashion? Essentially, I'm interested to know if I can read sections of topics on demand to produce processing/reports on its contents.. kind of in a similar to way to querying a relational DB via an admin console I guess.
Thanks in advance!
The Kafka consumer is able to consume topics at arbitrary positions.
You can use the seek() method to start consuming from a specific offset. You can also use the offsetsForTimes() method to find the offsets for a specific timestamp.
You can combine these two methods to consume specific sections of topics on demand.

Kafka Streams: reprocessing old data when windowing

Having a Kafka Streams application, that performs windowing(using original event time, not wallclock time) via Stream joins of e.g. 1 day.
If bringing up this topology, and reprocessing the data from the start (as in a lambda-style architecture), will this window keep that old data there? da
For example: if today is 2022-01-09, and I'm receiving data from 2021-03-01, will this old data enter the table, or will it be rejected from the start?
In that case - what strategies can be done to reprocess this data?
UPDATE Using Kafka Streams 2.5.0
Updated Answer to OP Kafka Streams version 2.5:
When using event time, Kafka Streams will behave independent of the wallclock time, as long as no events contain the wallclock time. You should not have configured a WallclockTimestampExtractor as your timestamp extractor.
Kafka Streams will assign you input topic partitions to stream tasks, that will consume the partitions one event at a time. On any given topic, at most one partition will be assigned to a stream task. Time-windowed aggregations are carried out for each stream task separately. Kafka Streams uses an internal timestamp called "observedStreamTime" for each aggregation to keep track of the maximum timestamp seen so far. Incoming records are checked for their timestamp in comparison to the observedStreamTime. If they are older than the retention + grace period of the configured time window store, they will be dropped. Otherwise, they will be aggregated according to the configuration. The implementation can be found at https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L108-L175
This processing will always yield the same result, if the Kafka Streams application is reset. It is independent on the execution time of the processing. If events are dropped, the corresponding metrics are changed.
There is one caveat with this approach, when multiple topics are consumed. The observedStreamTime will reflect the highest timestamp of all partitions read by the stream task. If you have two topics (maybe because you want to join them) and one contains considerably younger data than the other (maybe because the latter received no new data), the observedStreamTime will be dominated by the younger topic. Events of the older topic might be dropped, if the time window configuration does not have enough retention or grace periods. See the JavaDoc of TimeWindows on the configuration options: https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindows.java
In your example the old data will be accepted, as long as the stream time has not progress too far. Reprocessing the whole data set should work, since it will linearly progress through your topic. If the old data is aggregated in a time-window with exceeding the window size + grace period, Kafka Streams will reject the record. In that case Kafka Streams will also issue an error message and adjust its metrics accordingly. So this behaviour should be easy to pick up.
I suggest to try out this reprocessing if feasible and watch the logs and metrics.

Starting new Kafka Streams microservice, when there is data retention period on input topics

Lets assume i have (somewhat) high velocity input topic - for example sensor.temperature and it has retention period of 1 day.
Multiple microservices are already consuming data from it. I am also backing up events in historical event store.
Now (as a simplified example) I have new requirement - calculating maximum all time temperature per sensor.
This is fitting very well with Kafka Streams, so I have prepared new microservice that creates KTable aggregating temperature (with max) grouped per sensor.
Simply deploying this microservice would be enough if input topic had infinite retention, but now maximum would be not all-time, as is our requirement.
I feel this could be common scenario but somehow I was not able to find satisfying solution on the internet.
Maybe I am missing something, but my ideas how to make it work do not feel great:
Replay all past events into the input topic sensor.temperature. This is large amount of data and it would cause all subscribing microservices to run excessive computation, which is most likely not acceptable.
Create duplicate of input topic for my microservice: sensor.temperature.local, where I would always copy all events and then further process(aggregate) them from this local topic.
This way I can freely replay historical events into local topic without affecting other microservices.
However this local duplicate would be required for all Kafka Streams microservices, and if input topic is high velocity this could be too much duplication.
Maybe there some way to modify KTables more directly, so one could query the historical event store for max value per sensor and put it in the KTable once?
But what if streams topology is more complex? It would require orchestrating consistent state in all microsevice's KTables, rather than simply replaying events.
How to design the solution?
Thanks in advance for your help!
In this case I would create a topic that stores the max periodically (so that it won't fell off the topic beacuse of a cleanup). Then you could make your service report the max of the max-topic and the max of the measurement-topic.

Use Kafka offsets to calculate written messages statistics

I want to get some statistics from a Kafka topic:
total written messages
total written messages in the last 12 hours, last hour, ...
Can I safely assume that reading the offsets for each partition in a topic for a given timestamp (using getOffsetsByTimes) should give me the number of messages written in that specific time?
I can sum all the offsets for every partitions and then calculate the difference between a timestamp 1 and a timestamp 2. With these data I should be able to calculate a lot of statistics.
There are situations when these data can give me wrong results? I don't need a 100% precision, but I expect to have a reliable solution. Of course assuming that the topic is not deleted/reset.
There are other alternatives without using third party tools? (I cannot install other tools easily and I need data inside my app)
(using getOffsetsByTimes) should give me the number of messages written in that specific time?
In Kafka: The Definitive Guide it mentions that the getOffsetsByTime is not message-based, it is segment file based. Meaning the time index lookup won't read into a segment file, rather it gets the first segment containing the time you are interested in. (This may have changed in newer Kafka releases since the book was released)
If you don't need accuracy, this should be fine. Do note that compacted topics don't have sequentially ordered offsets one after the other, so a simple abs(offset#time2 - offset#time1) won't quite work for "total existing messages in a topic".
Otherwise, plenty of JMX metrics are exposed by the brokers like bytes-in and message rates, which you can aggregate and plot over time using Grafana, for example.

What is the frequency with which partition offsets are queried by driver using the direct Kafka API in Spark Streaming?

Are the offsets queried for every batch interval or at a different frequency?
When you use the term offsets, I'm assuming you're meaning the offset and not the actual message. Looking through documentation I was able to find two references to the direct approach.
The first one, from Apache Spark Docs
Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system).
This makes it seem like there are independent actions. Offsets are queried from Kafka, and then assigned to process in a specific batch. And querying offsets from Kafka can return offsets that cover multiple Spark batch jobs.
The second one, a blog post from databricks
Instead of receiving the data continuously using Receivers and storing it in a WAL, we simply decide at the beginning of every batch interval what is the range of offsets to consume. Later, when each batch’s jobs are executed, the data corresponding to the offset ranges is read from Kafka for processing (similar to how HDFS files are read).
This one makes it seem more like each batch interval itself fetches a range of offsets to consume. Then when running actually fetches those messages from Kafka.
I have never worked with Apache Spark, I mainly use Apache Storm + Kafka, but since the first doc suggests they can happen at different intervals I would assume they can happen at different intervals, and the blog post just doesn't mention it because it just doesn't get into the technical details.