How can I know that a kafka topic is full? - apache-kafka

Lets say I have one kafka broker configured with one partition
log.retention.bytes=80000
log.retention.hours=6
What will happen if I try to send a record with the producer api to a broker and the log of the topic got full before the retention period?
Will my message get dropped?
Or will kafka free some space from the old messages and add mine?
How can I know if a topic is getting full and logs are being deleted before being consumed?
Is there a way to monitor or expose a metric when a topic is getting full?

What will happen if I try to send a record with the producer api to a
broker and the log of the topic got full before the retention period?
Will my message get dropped? Or will kafka free some space from the
old messages and add mine?
cleanup.policy property from topic config which by default is delete, says that "The delete policy will discard old segments when their retention time or size limit has been reached."
So, if you send record with producer api and topic got full, it will discard old segments.
How can I know if a topic is getting full and logs are being deleted
before being consumed?
Is there a way to monitor or expose a metric when a topic is getting full?
You can get Partition size using below script:
/bin/kafka-log-dirs.sh --describe --bootstrap-server : --topic-list
You will need to develop a script that will run above script for fetching current size of topic and send it periodically to Datadog.
In Datadog, you can create widget that will trigger appropriate action(e.g. sending email alerts) once size reaches a particular threshold.

It's not exactly true, a topic is never full, at least by default.
I said by default because like #Mukesh said the cleanup.policy will discard old segments when their retention time or size limit is reached, but by default there is no size limit only a time limit and the property that handle that is retention.bytes (set by default to -1).
It will let only a time limit on message, note that the retention.bytes value is set by partition so to specify a limit on a topic, you have to multiply by the numbers of partitions on that topic.
EDIT :
There is a tons of metrics that kafka export (in JMX) and in thoses you can found global metrics about segments (total numbers, per topic numbers, size, rate of rolling segments etc...).

Related

kafka automatic deletion based on message consumption

Like how we have in MQ solutions , is it possible to have the message automatically deleted in Kafka once it is consumed ?
As I don't have control when the message will be consumed ,its not possible to define retention by time / byte size
You can override the configuration of retention by time per topic basis, even set it to 0 for no deletion at all. Retention byte size retention is not limited by default, and you don't have to use it. Being said that I am not sure Kafka is best suited for your use case as it meant to use used for real time high performance streaming processes... another note you can use COMPACT topic and send tombstone message to delete a record once processed, but basically kafka does not have automatic delete on consumption

Use case: To construct a message that disppears after sometime in kafka

I have scenario where i want to send message to a alert service that would process the message and would send it to hipchat.
But I want the message to be active only for a minute. If hipchat is down (hypothetical) then the message should not be sent to hipchat.
I am using kafka so one of the service sends the message to kafka then the message is consumed by alert service(it polls the service) which processes the message (kafka consumer) while processing it checks that the time now and the time of the message is not greater than one minute. If not, it sends the message to hipchat aynchronously.
Enhancement:
I want a way to construct a self destruction message so that i automatically disappears after one minute. Is there a way to do it with kafka ? OR is there a better alternate than kafka (flink/sqs). If yes, how?
You can make use of the Kafka topic configurations retention.ms and delete.retention.ms as described in the Topic Level Configs.
The retention.ms should be set to 1 minute (60000 ms) and the delete.retention.ms should be set to 0 in your case. That way, the messages will stay in the Kafka Topic for one minute before they get deleted. However, that also means that you might loose messages if your consumer takes more then one minute to consume all messages (especially when reading a topic from beginning).
Details on those configurations are:
delete.retention.ms: The amount of time to retain delete tombstone markers for log compacted topics. This setting also gives a bound on the time in which a consumer must complete a read if they begin from offset 0 to ensure that they get a valid snapshot of the final stage (otherwise delete tombstones may be collected before they complete their scan).
retention.ms: This configuration controls the maximum time we will retain a log before we will discard old log segments to free up space if we are using the "delete" retention policy. This represents an SLA on how soon consumers must read their data. If set to -1, no time limit is applied.

Kafka: Can we limit a number of messages per key?

Suppose I want to remain not more than 100 recent messages per key in Kafka topic. Can I reach this policy somehow? For example can I configure compaction policy to store recent N messages (not only one per key)?
There is no such way that Kafka provides. Kafka doesn't remove logs (kind of compaction) based on message-based recency. It rather depends on the time-lapse & size of a log file.

Kafka retention AFTER initial consuming

I have a Kafka cluster with one consumer, which is processing TB's of data every day. Once a message is consumed and committed, it can be deleted immediately (or after a retention of few minutes).
It looks like the log.retention.bytes and log.retention.hours configurations count from the message creation. Which is not good for me.
In case where the consumer is down for maintenance/incident, I want to keep the data until it comes back online. If I happen to run out of space, I want to refuse accepting new data from the producers, and NOT delete data that wasn't consumed yet (so the log.retention.bytes doesn't help me).
Any ideas?
If you can ensure your messages have unique keys, you can configure your topic to use compaction instead of timed-retention policy. Then have your consumer after having processed each message send a message back to the same topic with the message key but null value. Kafka would compact away such messages. You can tune compaction parameters to your needs (and log segment file size, since the head segment is never compacted, you may want to set it to a smaller size if you want compaction to kick in sooner).
However, as I mentioned before, this would only work if messages have unique keys, otherwise you can't simply turn on compaction as that would cause loss of previous messages with the same key during periods when your consumer is down (or has fallen behind the head segment).

Kafka optimal retention and deletion policy

I am fairly new to kafka so forgive me if this question is trivial. I have a very simple setup for purposes of timing tests as follows:
Machine A -> writes to topic 1 (Broker) -> Machine B reads from topic 1
Machine B -> writes message just read to topic 2 (Broker) -> Machine A reads from topic 2
Now I am sending messages of roughly 1400 bytes in an infinite loop filling up the space on my small broker very quickly. I'm experimenting with setting different values for log.retention.ms, log.retention.bytes, log.segment.bytes and log.segment.delete.delay.ms. First I set all of the values to the minimum allowed, but it seemed this degraded performance, then I set them to the maximum my broker could take before being completely full, but again the performance degrades when a deletion occurs. Is there a best practice for setting these values to get the absolute minimum delay?
Thanks for the help!
Apache Kafka uses Log data structure to manage its messages. Log data structure is basically an ordered set of Segments whereas a Segment is a collection of messages. Apache Kafka provides retention at Segment level instead of at Message level. Hence, Kafka keeps on removing Segments from its end as these violate retention policies.
Apache Kafka provides us with the following retention policies -
Time Based Retention
Under this policy, we configure the maximum time a Segment (hence messages) can live for. Once a Segment has spanned configured retention time, it is marked for deletion or compaction depending on configured cleanup policy. Default retention time for Segments is 7 days.
Here are the parameters (in decreasing order of priority) that you can set in your Kafka broker properties file:
Configures retention time in milliseconds
log.retention.ms=1680000
Used if log.retention.ms is not set
log.retention.minutes=1680
Used if log.retention.minutes is not set
log.retention.hours=168
Size based Retention
In this policy, we configure the maximum size of a Log data structure for a Topic partition. Once Log size reaches this size, it starts removing Segments from its end. This policy is not popular as this does not provide good visibility about message expiry. However it can come handy in a scenario where we need to control the size of a Log due to limited disk space.
Here are the parameters that you can set in your Kafka broker properties file:
Configures maximum size of a Log
log.retention.bytes=104857600
So according to your use case you should configure log.retention.bytes so that your disk should not get full.