Use Kafka offsets to calculate written messages statistics

Use Kafka offsets to calculate written messages statistics - apache-kafka

I want to get some statistics from a Kafka topic:
total written messages
total written messages in the last 12 hours, last hour, ...
Can I safely assume that reading the offsets for each partition in a topic for a given timestamp (using getOffsetsByTimes) should give me the number of messages written in that specific time?
I can sum all the offsets for every partitions and then calculate the difference between a timestamp 1 and a timestamp 2. With these data I should be able to calculate a lot of statistics.
There are situations when these data can give me wrong results? I don't need a 100% precision, but I expect to have a reliable solution. Of course assuming that the topic is not deleted/reset.
There are other alternatives without using third party tools? (I cannot install other tools easily and I need data inside my app)

(using getOffsetsByTimes) should give me the number of messages written in that specific time?
In Kafka: The Definitive Guide it mentions that the getOffsetsByTime is not message-based, it is segment file based. Meaning the time index lookup won't read into a segment file, rather it gets the first segment containing the time you are interested in. (This may have changed in newer Kafka releases since the book was released)
If you don't need accuracy, this should be fine. Do note that compacted topics don't have sequentially ordered offsets one after the other, so a simple abs(offset#time2 - offset#time1) won't quite work for "total existing messages in a topic".
Otherwise, plenty of JMX metrics are exposed by the brokers like bytes-in and message rates, which you can aggregate and plot over time using Grafana, for example.

Related

Bucketizing Kafka Data with Partitions

I have a situation where I’m loading data into Kafka. I would like to process the records in discrete 10m buckets. But bare in mind that the record time stamps come from the producers and so they may not be perfectly in the right order so I can’t simply use the standard Kafka consumer approach since that will result in records outside of my discrete bucket.
Is it possible to use partitions for this? I could look at the timestamp of each record before placing it in the topic, using that to select the appropriate partition. But I don’t know if Kafka supports adhoc named partitions.

They aren't "named" partitions. Sure, you could define a topic with 6 partitions (10 minute "buckets", ignoring hours and days) and a Partitioner subclass that computes which partition the record timestamp will go into with a simple math function, however, this is really only useful for ordering and doesn't address that you need to consume from two partitions for every non-exact 10 minute interval. E.g. records at minute 11 (partition 1) would need to consume records with minute 1-9 (partition 0).
Overall, sounds like you want sliding/hopping windowing features of Kafka Streams, not the plain Consumer API. And this will work without writing custom Producer Partitioners with any number of partitions.

Starting new Kafka Streams microservice, when there is data retention period on input topics

Lets assume i have (somewhat) high velocity input topic - for example sensor.temperature and it has retention period of 1 day.
Multiple microservices are already consuming data from it. I am also backing up events in historical event store.
Now (as a simplified example) I have new requirement - calculating maximum all time temperature per sensor.
This is fitting very well with Kafka Streams, so I have prepared new microservice that creates KTable aggregating temperature (with max) grouped per sensor.
Simply deploying this microservice would be enough if input topic had infinite retention, but now maximum would be not all-time, as is our requirement.
I feel this could be common scenario but somehow I was not able to find satisfying solution on the internet.
Maybe I am missing something, but my ideas how to make it work do not feel great:
Replay all past events into the input topic sensor.temperature. This is large amount of data and it would cause all subscribing microservices to run excessive computation, which is most likely not acceptable.
Create duplicate of input topic for my microservice: sensor.temperature.local, where I would always copy all events and then further process(aggregate) them from this local topic.
This way I can freely replay historical events into local topic without affecting other microservices.
However this local duplicate would be required for all Kafka Streams microservices, and if input topic is high velocity this could be too much duplication.
Maybe there some way to modify KTables more directly, so one could query the historical event store for max value per sensor and put it in the KTable once?
But what if streams topology is more complex? It would require orchestrating consistent state in all microsevice's KTables, rather than simply replaying events.
How to design the solution?
Thanks in advance for your help!

In this case I would create a topic that stores the max periodically (so that it won't fell off the topic beacuse of a cleanup). Then you could make your service report the max of the max-topic and the max of the measurement-topic.

How the Kafka Topic Offsets works

I have a question about how the topic offsets works in Kafka, are they stored B-Tree like structure in Kafka?
The specific reason I ask for it, lets say I have a Topic with 10 millions records in Topic, that will mean 10 millions offset if no compaction occurred or it is turned off, now if I use consumer.seek(5000000), it will work like LinkList by that I mean, it will go to 0 offset and will try to hop from there to 5000000th offset or it does have index like structure will tell exactly where is the 5000000th record in the log?
Thx for answers?

Kafka records are stored sequentially in the logs. The exact format is well described in the documentation.
Kafka usually expects read to be sequential, as Consumers fetch records in order. However when a random access is required (via seek or to restart from a specific position), Kafka uses index files to quickly find a record based on its offset.
A Kafka log is made of several segments. Each segments has an index and a timeindex file associated which map offsets and timestamp to file position. The frequency at which entries are added to the indexes can be configured using index.interval.bytes. Using these files Kafka is able to immediately seek to the nearby position and avoid re-reading all messages.
You may have noticed after an unclean shutdown that Kafka is rebuilding indexes for a few minutes. It's these indexes used to file position lookups that are being rebuilt.

Is it possible to filter Apache Kafka messages by retention time?

At an abstract point of view Apache Kafka stores data in topics. This data could be read by a consumer.
I'd like to have a (monitor)-consumer which greps data with a certain age. The monitor should send a warning to subsystems that records are still unread and would be discarded by Kafka if they reach retention time.
I couldn't find a suitable way until now.

You can use KafkaConsumer.offsetsForTimes() to map messages to dates.
For example, if you call it with the date of yesterday and it returns offset X, then any messages with an offset smaller than X are older than yesterday.
Then your logic can figure out from the current positions of your consumers if you are at risk of having unprocessed records discarded.
Note that there is currently a KIP under discussion to expose metrics to track that: https://cwiki.apache.org/confluence/display/KAFKA/KIP-223+-+Add+per-topic+min+lead+and+per-partition+lead+metrics+to+KafkaConsumer
http://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes-java.util.Map-

What is the frequency with which partition offsets are queried by driver using the direct Kafka API in Spark Streaming?

Are the offsets queried for every batch interval or at a different frequency?

When you use the term offsets, I'm assuming you're meaning the offset and not the actual message. Looking through documentation I was able to find two references to the direct approach.
The first one, from Apache Spark Docs
Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system).
This makes it seem like there are independent actions. Offsets are queried from Kafka, and then assigned to process in a specific batch. And querying offsets from Kafka can return offsets that cover multiple Spark batch jobs.
The second one, a blog post from databricks
Instead of receiving the data continuously using Receivers and storing it in a WAL, we simply decide at the beginning of every batch interval what is the range of offsets to consume. Later, when each batch’s jobs are executed, the data corresponding to the offset ranges is read from Kafka for processing (similar to how HDFS files are read).
This one makes it seem more like each batch interval itself fetches a range of offsets to consume. Then when running actually fetches those messages from Kafka.
I have never worked with Apache Spark, I mainly use Apache Storm + Kafka, but since the first doc suggests they can happen at different intervals I would assume they can happen at different intervals, and the blog post just doesn't mention it because it just doesn't get into the technical details.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse