Sending messages to Kafka unbuffered using kafkacat - apache-kafka

I have single node Kafka instance running locally via docker-compose.
(system: Mac/Arm64, image: wurstmeister/kafka:2.13-2.6.0)
I want to use kafkacat (kcat installed via Homebrew) to instantly produce and consume messages to and from Kafka.
Here is a minimal script:
#!/usr/bin/env bash
NUM_MESSAGES=${1:-3} # use arg1 or use default=3
KCAT_ARGS="-q -u -c $NUM_MESSAGES -b localhost:9092 -t unbuffered"
log() { echo "$*" 1>&2; }
producer() {
log "starting producer"
for i in `seq 1 3`; do
echo "msg $i"
log "produced: msg $i"
sleep 1
done | kcat $KCAT_ARGS -P
}
consumer() {
log "starting consumer"
kcat $KCAT_ARGS -C -o end | while read line; do
log "consumed: $line"
done
}
producer&
consumer&
wait
I would expect (roughly) the following output:
starting producer
starting consumer
produced: msg 1
consumed: msg 1
produced: msg 2
consumed: msg 2
produced: msg 3
consumed: msg 3
However, I only get output with produced and consumed messages fully batched into two groups, even though both the consumer and producer are running in parallel:
starting producer
starting consumer
produced: msg 1
produced: msg 2
produced: msg 3
consumed: msg 1
consumed: msg 2
consumed: msg 3
Here are some kafkacat/kafka producer properties and the values I already tried to change the producer behavior.
# kcat options having no effect on the test case
-u # unbuffered output
-T # act like `tee` and echo input
# kafka properties having no effect on the test case
-X queue.buffering.max.messages=1
-X queue.buffering.max.kbytes=1
-X batch.num.messages=1
-X queue.buffering.max.ms=100
-X socket.timeout.ms=100
-X max.in.flight.requests.per.connection=1
-X auto.commit.interval.ms=100
-X request.timeout.ms=100
-X message.timeout.ms=100
-X offset.store.sync.interval.ms=1
-X message.copy.max.bytes=100
-X socket.send.buffer.bytes=100
-X linger.ms=1
-X delivery.timeout.ms=100
None of the options above had any effect on the pipeline.
What am I missing?
Edit: It seems to be a flushing issue with either kcat or librdkafka. Maybe the -X properties are not used correctly.
Here are the current observations (will edit them as I learn more):
When sending a larger payload of 10000 messages with a smaller delay in the script, kcat will produce several batches of messages. It seems to be size-based, but not configurable by any of the -X options.
The batches are then also correctly picked up by the consumer. So it must be a producer issue.
I also tried the script in docker with the current kafkacat from the apline repos. This one seems to flush a but earlier; with less data needed to fill the "hidden" buffer. The -X options also had no effect.
Also the -X properties seem to be checked. If I set out-of-range values, kcat (or maybe librdkafka) will complain. However, setting low values for any of the timeout and buffer size values has no effect.
When calling kcat for every message (which is a bit of an overkill), the messages are produced instantly.
The question remains:
How do I tell a Kafka-pipeline to instantly produce my first message?
If you have an example in Go, this would also help, since I am having similar observations with a small Go program using kafka-go. I may post a separate question if I can strip that down to a postable format.
UPDATE: I tried using a bitnami image on a pure Linux host. Producing and consuming via kafkacat works as expected on this system. I will post an answer once I know more.

Here is how I solved the problem.
The issue was not in the Kafka docker images.
They all work as expected, although I was able to crash the Java-based Kafkas by just firing up kcat against them. I later added rpk (RedPanda, a non-Java "Kafka"), which was much more stable in my single node setup.
Findings
Using kcat I did not find any way of producing messages instantly without buffering. It notoriously ignores all -X args. (edenhill/kcat Version 1.7.0, MacOS, Arm64)
Sending single messages works. When closing the input pipe, kcat will flush the output buffer.
Consuming messages instantly via kcat is possible and works by default.
Other Kafka clients do not have this issue. I created a small kafka-go example that just works as expected; no extensive buffering by default.
Conculsion
Do not use kcat to produce messages via long-running pipes.
Use kafka-go or a similar client event for small health checks and other "scripts".

Related

kcat protobuf deserialization

I'm using kcat to check the content of kafka topics when working locally but, when messages are serialized with protobuf, the result I get is an unreadable stream of encoded characters. I'm aware of the existence of some other kafka-consumers tools (Kafdrop, AKHQ, Kowl, Kadek...) but I'm looking for the simplest option which fits my needs.
Does kcat support protobuf key/value deserialization from protofile?
Is there any simple terminal-based tool which allows this?
I've had luck with this command:
% kcat -C -t <topic> -b <kafkahost>:9092 -o -1 -e -q -D "" | protoc --decode=<full message class> path/to/my.proto --proto_path <proto_parent_folder>
any simple terminal-based tool which allows this
Only ones that integrate with the Confluent Schema Registry (which is what those linked tools use as well), e.g. kafka-protobuf-console-consumer is already part of Confluent Platform.
Regarding kcat - refer https://github.com/edenhill/kcat/issues/72 and linked issues

Mongodb Kafka messages not seen by topic

I encountered that my topic despite running and operating doesn't register events occuring in my MongoDB.
Everytime I insert/modify record I'm not getting anymore logs from kafka-console-consumer command.
Is there a way clear Kafka's cache/offset maybe?
Source and sink connection are up and running. Entire cluster is also healthy, thing is that everything worked as usual but every couple weeks I see this coming back or when I log into my Mongo cloud from other location.
--partition 0 parameter didn't help, changing retention_ms to 1 too.
I checked both connectors' status and got RUNNING:
curl localhost:8083/connectors | jq
curl localhost:8083/connectors/monit_people/status | jq
Running docker-compose logs connect I found:
WARN Failed to resume change stream: Resume of change stream was not possible, as the resume point may no longer be in the oplog. 286
If the resume token is no longer available then there is the potential for data loss.
Saved resume tokens are managed by Kafka and stored with the offset data.
When running Connect in standalone mode offsets are configured using the:
`offset.storage.file.filename` configuration.
When running Connect in distributed mode the offsets are stored in a topic.
Use the `kafka-consumer-groups.sh` tool with the `--reset-offsets` flag to reset offsets.
Resetting the offset will allow for the connector to be resume from the latest resume token.
Using `copy.existing=true` ensures that all data will be outputted by the connector but it will duplicate existing data.
Future releases will support a configurable `errors.tolerance` level for the source connector and make use of the `postBatchResumeToken
Issue requires more practice with Confluent Platform thus for now I re-built entire environment by removing entire container with:
docker system prune -a -f --volumes
docker container stop $(docker container ls -a -q -f "label=io.confluent.docker").
After running docker-compose up -d all is up and working.

kafka consumer in shell script

I am newbie in Kafka. I want to consume remote kafka message in shell script. Basically I have linux machine where I cannot run any web server (some strange reasons) only thing I can do is use crontab/shell script to listen for kafka message which is hosted remotely. Is it possible to write simple shell script which will consume kafka message, parse it and take corresponding action.
kafka clients are available in multiple languages. you can use any client, you don't need any web server or browser for it.
you may use shell script for consuming message & parsing but that script have to use any kafka client provided here because currently there is no client written in pure shell script.
Kafka has provided kafka client console producer and consumer, you can use that as well.
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
follow the document properly.
You could also use the tool kafkacat which is documented for example here.
This is a very powerful and fast tool to read data out of Kafka from the console and is open source: https://github.com/edenhill/kafkacat.
Many exmples are provided on GitHub and one example is shown below:
kafkacat -C -b mybroker -t mytopic

does kakfka-console-consumer consumes from multiple partitions

I have a kafka topic with multiple partitions.
I want to dump the topic into a file to conduct some analysis there, therefore i think the easiest way to do it is to use the kafka-console-consumer.
My question is, does the kafka-console-consumer is able to read to all the topic partitions, or it will be assigned to a single partition?
If not, how do assign the kafka-console-consumer to a specific partition, therefore a would have to start as many kafka-console-consumers as partitions.
Yes, kafka-console-consumer reads from all available partitions. Also take a look at kafkacat, it has some more advanced but usefull features.
As previously mentioned by Lukas, you can use kafkacat to dump all messages from a topic to a local file:
kafkacat -b a_broker -t your_topic -o beginning > all_msgs.bin
This will consume messages from all partitions, if you want to limit to a single partition use the -p <partition> switch.
The default output message separator is newline (\n), if you want something else use the -f "<fmt>" option.

How to copy a topic from a kafka cluster to another kafka cluster?

One way to do that as the Kafka documentation shows is through kafka.tools.MirrorMaker which can do that trick. However, I need to copy a topic (say test with 1 partition) (its content and meta data) from a production environment to a development environment where connectivity is not there. I could do simple file transfer between environments though.
My question: if I move the *.log and .index from the folder test-0 to the destination Kafka cluster, is that good enough? Or there is more that I need to do like meta data and ZooKeeper-related data that I need to move too?
Just copying the log and indexes will not suffice - kafka stores offsets and topic meta data in zookeeper. MirrorMaker is actually a quite simple tool, it spawns consumers to the source topic as well as producers to the target topic and runs until all consumers consumed the source queue. You can't find a simpler process to migrate a topic.
Use kafkacat
Unless your data is binary,
you can use a stock kafkacat.
Write topic to file:
kafkacat -b broker:9092 -e -K, -t my-topic > my-topic.txt
Write file back to topic:
kafkacat -b broker:9092 -K, -t my-topic -l my-topic.txt
If your data is binary,
you unfortunately have to build your own kafkacat from this branch, which is an as of yet unmerged PR.
Write topic with binary values to file:
kafkacat -b broker:9092 -e -Svalue=base64 -K, -t my-topic > my-topic.txt
Write file back to topic:
kafkacat -b broker:9092 -Svalue=base64 -K, -t my-topic -l my-topic.txt
What worked for me in your scenario was the following sequence of actions:
Create the topic in Kafka where you will later insert your files (with 1 partition and 1 replica and an appropriate retention.ms config so that Kafka doesn't delete your presumably outdated segments).
Stop your Kafka and Zookeeper.
Find the location of the files of the 0-partition you created in Kafka in step 1 (it will be something like kafka-logs-<hash>/<your-topic>-0).
In this folder, remove the existing files and copy your files to it.
Start Kafka and Zookeeper.
This also works if your Kafka is run from docker-compose (but you'll have to set up an appropriate volume, of course).