How to force log compaction of a Kafka topic? - apache-kafka

Using Kafka 2.7.0 (in K8s), I create a test topic with cleanup.policy=compact:
./kafka-topics.sh --create --bootstrap-server kafka.core-kafka.svc.cluster.local:9092 --topic _test_quick_compaction_2021_12_02 --partitions 1 --replication-factor 3 --config cleanup.policy=compact
Write some messages to it:
kafkacat -b kafka.core-kafka.svc.cluster.local:9092 -P -t _test_quick_compaction_2021_12_02 -K:
1:a
2:b
3:c
1:d
2:e
Change the topic settings in a way such that compaction should kick in after 10 seconds:
./kafka-topics.sh --alter --zookeeper zookeeper.core-kafka.svc.cluster.local --topic _test_quick_compaction_2021_12_02 --config max.compaction.lag.ms=10000 --config min.cleanable.dirty.ratio=0.0 --config segment.ms=10000 --config delete.retention.ms=10000
Wait a minute, just to be sure:
sleep 60
Check the topic content:
kafkacat -C -e -o beginning -b kafka.core-kafka.svc.cluster.local:9092 -t _test_quick_compaction_2021_12_02 -K:
And to my surprise, the content is still
1:a
2:b
3:c
1:d
2:e
instead of the
3:c
1:d
2:e
which I expected.
Why is the topic not compacted, and what can I do to force it?

Since active segments are not eligible for compaction, the trick was to again write something to the topic to force the creation of a new segment.
# Create a test topic.
./kafka-topics.sh --create --bootstrap-server kafka.core-kafka.svc.cluster.local:9092 --topic _test_quick_compaction_2021_12_02 --partitions 1 --replication-factor 3 --config cleanup.policy=compact
# Write some messages to it.
echo "1:a\n2:b\n3:c" | kafkacat -b kafka.core-kafka.svc.cluster.local:9092 -P -t _test_quick_compaction_2021_12_02 -K:
# Check the topic content.
kafkacat -C -e -o beginning -b kafka.core-kafka.svc.cluster.local:9092 -t _test_quick_compaction_2021_12_02 -K:
# Change the topic settings in a way such that compaction should kick in after 10 seconds.
./kafka-topics.sh --alter --zookeeper zookeeper.core-kafka.svc.cluster.local --topic _test_quick_compaction_2021_12_02 --config max.compaction.lag.ms=10000 --config min.cleanable.dirty.ratio=0.0 --config segment.ms=10000 --config delete.retention.ms=10000
# Wait for the last segment to outdate
sleep 11
# Write new messages.
echo "1:d\n2:e" | kafkacat -b kafka.core-kafka.svc.cluster.local:9092 -P -t _test_quick_compaction_2021_12_02 -K:
# Check the topic content.
kafkacat -C -e -o beginning -b kafka.core-kafka.svc.cluster.local:9092 -t _test_quick_compaction_2021_12_02 -K:
# Wait for this segment to outdate.
sleep 11
# Write new messages again.
echo "1:d\n2:e" | kafkacat -b kafka.core-kafka.svc.cluster.local:9092 -P -t _test_quick_compaction_2021_12_02 -K:
# Check the topic content.
kafkacat -C -e -o beginning -b kafka.core-kafka.svc.cluster.local:9092 -t _test_quick_compaction_2021_12_02 -K:
# Wait for compaction to happen.
sleep 11
# Check the topic content to validate that it has been compacted.
kafkacat -C -e -o beginning -b kafka.core-kafka.svc.cluster.local:9092 -t _test_quick_compaction_2021_12_02 -K:
# Revert the setting changes.
./kafka-topics.sh --alter --zookeeper zookeeper.core-kafka.svc.cluster.local --topic _test_quick_compaction_2021_12_02 --delete-config max.compaction.lag.ms --delete-config min.cleanable.dirty.ratio --delete-config segment.ms --delete-config delete.retention.ms
# Delete the topic
# /home/th/kafka_2.13-2.7.0/bin/kafka-topics.sh --delete --bootstrap-server kafka.core-kafka.svc.cluster.local:9092 --topic _test_quick_compaction_2021_12_02

Related

Kafkacat Produce message from a file with headers

I need to produce batch messages to Kafka so I have a file that I feed kafkacat:
kafkacat -b localhost:9092 -t <my_topic> -T -P -l /tmp/msgs
The content of /tmp/msgs is as follows
-H "id=1"
{"key" : "value0"}
-H "id=2"
{"key" : "value1"}
When I run the kafkacat command above, it inserts four messages to kafka - one message per line in /tmp/msgs.
I need to instruct kafkacat to parse the file correctly - that is -H "id=1" is the header for the message {"key" = "value0"}.
How do I achieve this?
Thanks
You need to pass the headers as follows.
kcat -b localhost:9092 -t topic-name -P -H key1=value1 -H key2=value2 /temp/payload.json

Is there a way to add headers in kafka-console-producer.sh

I'd like to use the kafka-console-producer.sh to fire a few JSON messages with Kafka headers.
Is this possible?
docker exec -it kafka_1 /opt/kafka_2.12-2.3.0/bin/kafka-console-producer.sh --broker-list localhost:9093 --topic my-topic --producer.config /opt/kafka_2.12-2.3.0/config/my-custom.properties
No, but you can with kafkacat's -H argument:
Produce:
echo '{"col_foo":1}'|kafkacat -b localhost:9092 -t test -P -H foo=bar
Consume:
kafkacat -b localhost:9092 -t test -C -f '-----\nTopic %t[%p]\nOffset: %o\nHeaders: %h\nKey: %k\nPayload (%S bytes): %s\n'
-----
Topic test[0]
Offset: 0
Headers: foo=bar
Key:
Payload (9 bytes): col_foo:1
% Reached end of topic test [0] at offset 1
Starting from kafka 3.1.0 there is an option to turn on headers parsing parse.headers=true and then you just place them before your record value, info form docs:
| parse.headers=true:
| "h1:v1,h2:v2...\tvalue"
So your command will look like
kafka-console-producer.sh --bootstrap-server localhost:9092 --topic topic_name --property parse.headers=true
and then you pass
header_name:header_value\nrecord_value

how to view kafka headers

We are sending message with headers to Kafka using
org.apache.kafka.clients.producer.ProducerRecord
public ProducerRecord(String topic, Integer partition, K key, V value, Iterable<Header> headers) {
this(topic, partition, (Long)null, key, value, headers);
}
How can I actually see these headers using command. kafka-console-consumer.sh only shows me payload and no headers.
You can use the excellent kafkacat tool.
Sample command:
kafkacat -b kafka-broker:9092 -t my_topic_name -C \
-f '\nKey (%K bytes): %k
Value (%S bytes): %s
Timestamp: %T
Partition: %p
Offset: %o
Headers: %h\n'
Sample output:
Key (-1 bytes):
Value (13 bytes): {foo:"bar 5"}
Timestamp: 1548350164096
Partition: 0
Offset: 34
Headers: __connect.errors.topic=test_topic_json,__connect.errors.partition=0,__connect.errors.offset=94,__connect.errors.connector.name=file_sink_03,__connect.errors.task.id=0,__connect.errors.stage=VALU
E_CONVERTER,__connect.errors.class.name=org.apache.kafka.connect.json.JsonConverter,__connect.errors.exception.class.name=org.apache.kafka.connect.errors.DataException,__connect.errors.exception.message=Co
nverting byte[] to Kafka Connect data failed due to serialization error: ,__connect.errors.exception.stacktrace=org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed
due to serialization error:
The kafkacat header option is only available in recent builds of kafkacat; you may want to build from master branch yourself if your current version doesn't include it.
You can also run kafkacat from Docker:
docker run --rm edenhill/kafkacat:1.5.0 \
-b kafka-broker:9092 \
-t my_topic_name -C \
-f '\nKey (%K bytes): %k
Value (%S bytes): %s
Timestamp: %T
Partition: %p
Offset: %o
Headers: %h\n'
If you use Docker bear in mind the network implications of how to reach the Kafka broker.
Starting with kafka-2.7.0 you can enable printing headers in console-consumer by providing property print.headers=true
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic quickstart-events --property print.key=true --property print.headers=true --property print.timestamp=true
You can also use kafkactl for this. E.g. with output as yaml:
kafkactl consume my-topic --print-headers -o yaml
Sample output:
partition: 1
offset: 22
headers:
key1: value1
key2: value2
value: my-value
Disclaimer: I am contributor to this project
From kafka-console-consumer.sh script:
exec $(dirname $0)/kafka-run-class.sh kafka.tools.ConsoleConsumer "$#"
src: https://github.com/apache/kafka/blob/2.1.1/bin/kafka-console-consumer.sh
In kafka.tools.ConsoleConsumer the header is provided to the Formatter, but none of the existing Formatters makes use of it:
formatter.writeTo(new ConsumerRecord(msg.topic, msg.partition, msg.offset, msg.timestamp,
msg.timestampType, 0, 0, 0, msg.key, msg.value, msg.headers),
output)
src: https://github.com/apache/kafka/blob/2.1.1/core/src/main/scala/kafka/tools/ConsoleConsumer.scala
At the bottom of the above link you can see existing Formatters.
If you want to print headers you need to implement your own kafka.common.MessageFormatter and in particular its write method:
def writeTo(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]], output: PrintStream): Unit
and then run your console consumer with --formatter providing your own formatter (it should also be present on the classpath).
Another, simpler and faster way, would be to implement your own mini-program using KafkaConsumer and check headers in debug.
kcat -C -b $brokers -t $topic -f 'key: %k Headers: %h: Message value: %s\n'

Kafka producer to read data files

I am trying to load a data file in loop(to check stats) instead of standard input in Kafka. After downloading Kafka, I performed the following steps:
Started zookeeper:
bin/zookeeper-server-start.sh config/zookeeper.properties
Started Server:
bin/kafka-server-start.sh config/server.properties
Created a topic named "test":
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Ran the Producer:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Test1
Test2
Listened by the Consumer:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Test1
Test2
Instead of Standard input, I want to pass a data file to the Producer which can be seen directly by the Consumer. Or is there any kafka producer instead of console consumer using which I can read data files. Any help would really be appreciated. Thanks!
You can read data file via cat and pipeline it to kafka-console-producer.sh.
cat ${datafile} | ${kafka_home}/bin/kafka-console-producer.sh --broker-list ${brokerlist} --topic test
If there is always a single file, you can just use tail command and then pipeline it to kafka console producer.
But if a new file will be created when some conditions met, you may need use apache.commons.io.monitor to monitor new file created, then repeat above.
Kafka has this built-in File Stream Connector, for piping the content of a file to producer(file source), or directing file content to another destination(file sink).
We have bin/connect-standalone.sh to read from file which can be configured in config/connect-file-source.properties and config/connect-standalone.properties.
So the command will be:
bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
The easiest way if you are using Linux or Mac is:
kafka-console-producer --broker-list localhost:9092 --topic test < messages.txt
Reference:
https://github.com/Landoop/kafka-cheat-sheet
You can probably try the kafkacat utility as well.
The readme on Github provides examples
It would be great if you could share which tool worked the best for you :)
Details from KafkaCat Readme:
Read messages from stdin, produce to 'syslog' topic with snappy compression
$ tail -f /var/log/syslog | kafkacat -b mybroker -t syslog -z snappy
kafka-console-produce.sh \
--broker-list localhost:9092 \
--topic my_topic \
--new-producer < my_file.txt
Follow this link: http://grokbase.com/t/kafka/users/157b71babg/kafka-producer-input-file
Below command is ofcourse the easiest way to do that.
kafka-console-producer --broker-list localhost:9092 --topic test < message.txt
But sometimes it is not able to find the file.
example :
C:\kafka_2.11-2.4.0\bin\windows>kafka-console-producer.bat --broker-list localhost:9092 --topic jason-input < C:\data\message.txt
you given the actual path but it is not able to find C at the current location so it will give the error : file not found. We would be thinking that we have given the actual path so it will go to root and it will start the path from there but it is finding the C(root) at the current place.
Solution for that is to give the ..\ into the path to move to the parent folder.
for example.
you are executing the command like
C:\kafka_2.11-2.4.0\bin\windows>kafka-console-producer.bat --broker-list localhost:9092 --topic jason-input < ..\..\..\data\message.txt
as of now i am into the windows folder. ..\ will move the current directory to bin folder and again ..\ will move the current directory to the kafka.... folder and again ..\ will move to the C:. so now my path starts. data and then message.txt

What happens after modifying Kafka topic replication factor?

According to the docs, you just modify a topic to increase the replication factor.
> bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic my_topic_name
--partitions 20 --replication-factor 3 --config x=y
Unfortunately, it doesn't specify what happens then after you modify the topic. Do existing log segments get replicated to the new replicas, or only new messages?
For kafka 10 you need to invoke the kafka-reassign-partitions.sh script to change the replication factor of the topic.
Here is a demo of a script that will display the topic before and after the change:
updateTopicReplication() {
TOPIC_NAME=$1
REPLICAS=$2
echo "****************************************"
echo "describe $TOPIC_NAME"
bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic $TOPIC_NAME
JSON_FILE=./configChanges/${TOPIC_NAME}-update.json
echo $JSON_FILE
[ -e $JSON_FILE ] && rm $JSON_FILE
touch $JSON_FILE
echo -e "{\"version\":1, \"partitions\":[{\"topic\":\"${TOPIC_NAME}\",\"partition\":0,\"replicas\":[${REPLICAS}]}]}" >> $JSON_FILE
echo "****************************************"
echo "updateing $TOPIC_NAME"
bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file $JSON_FILE --execute
echo "****************************************"
echo "describe $TOPIC_NAME"
bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic $TOPIC_NAME
}
It seems that kafka actually doesn't support increasing (or decreasing) the replication factor for a topic, according to the same docs I mentioned in the question.