I have a kafka topic with multiple partitions.
I want to dump the topic into a file to conduct some analysis there, therefore i think the easiest way to do it is to use the kafka-console-consumer.
My question is, does the kafka-console-consumer is able to read to all the topic partitions, or it will be assigned to a single partition?
If not, how do assign the kafka-console-consumer to a specific partition, therefore a would have to start as many kafka-console-consumers as partitions.
Yes, kafka-console-consumer reads from all available partitions. Also take a look at kafkacat, it has some more advanced but usefull features.
As previously mentioned by Lukas, you can use kafkacat to dump all messages from a topic to a local file:
kafkacat -b a_broker -t your_topic -o beginning > all_msgs.bin
This will consume messages from all partitions, if you want to limit to a single partition use the -p <partition> switch.
The default output message separator is newline (\n), if you want something else use the -f "<fmt>" option.
Related
I'm trying to run some tests to understand MM2 behavior. As part of that I had the following questions:
How to correctly pass a custom consumer group for MM2 in mm2.properties?
Based on this question, tried passing <alias>.group.id=temp_cons_group in mm2.properties and on restarting the MM2 instance could see the consumer group mentioned in the MM2 logs.
However, when I try listing consumer groups registered in the source broker, the group doesn't show up?
How to test if the property <alias>.consumer.auto.offset.reset works?
Here, I want to consume the same messages again so in reference to the question, tried setting <source_alias>.consumer.auto.offset.reset to earliest and restarted MM2.
I was able to see the property set correctly in MM2 logs but did not get the messages from the beginning in the target cluster topic.
How do I start a MM2 instance to start consuming messages from a specific offset for a topic present in the source cluster?
MirrorMaker does not use a consumer group to run and instead uses the assign() API, so it's expected that you don't see a group.
It's hard to "test". One way to verify this configuration was picked up is to check it's present in the logs when MirrorMaker starts its consumers.
This is currently not trivial to do. There's a KIP in progress to improve the process but at the moment it requires manually updating the internal offset topic from your Connect instance. At a very high level, here's the process:
First, ensure MirrorMaker is not running. Then you need to find the offset records for MirrorMaker in the offsets topic using a command like:
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic <CONNECT_OFFSET_TOPIC \
--from-beginning \
--property print.key=true | grep <SOURCE_CONNECTOR_NAME>
You will see records with offsets for each partition MirrorMaker handles. To update the offsets, you need to produce new records to this topic with the offsets you want. For each partition, ensure your record has the same key as the existing message so it replaces the existing stored offsets.
How can I list all the topics that have data flowing in them?
I have a Kafka cluster (Confluent 7.2.1) with zookeeper. And I have over 120 topics where some of them are not active. I can list all the topics using kafka-topics --list --bootstrap-server.... This will return the list of all the topics.
I am looking to list the topics that currently have data in it. I have data on my topics for up to 7 days.
Thank you
Out of the box, there is no easy way to do this.
You would need to loop over every topic, then pass name to GetOffsetShell command with -1 and -2 flags to get start and end offsets, then compare them; when they match, the topic is currently empty, but that doesn't mean that a new producer didn't spin up within the few seconds you run the tool.
Alternatively, there may be metrics tools like Prometheus exporters that can query the disk-usage of the broker log directories and report empty folders
There are currently 22 replicas configured for specific topic in Kafka 0.9.0.1.
Is it possible to reduce the replication factor of the topic to 3?
How to do it via Kafka CLI or Kafka Manager?
I found a way to increase replicas number only here
Yes. Changing (increasing or decreasing) the replication factor can be done using the following 2-step process:
First, you'll need to create a partition assignment structure for the given topic in the form of a json file. Here's an example:
{
"version":1,
"partitions":[
{"topic":"<topic-name>","partition":0,"replicas":[<broker-ids>]},
{"topic":"<topic-name","partition":1,"replicas":[<broker-ids>]},
...
{"topic":"<topic-name","partition":n,"replicas":[<broker-ids>]},
]
}
Save this file with any name. Let's say - decrease-replication-factor.json.
Note - The <broker-ids> in the end represents the comma separated list of broker ids you want your replicas to exist on.
Run the script kafka-reassign-paritions and supply the above json as an input in the following way:
kafka-reassign-partitions --zookeeper <zookeeper-server-list>:2181
--reassignment-json-file decrease-replication-factor.json --execute
Now, if you run the describe command for the given topic, you should see the reduced replicas as per the supplied json.
There are some tools as well created in the Kafka community that can help you achieve this. Here is one such example created by LinkedIn.
For reading all partitions in topic:
~bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic myTopic --from-beginning
How can I consume particular partition of the topic? (for instance with partition key 13)
And how produce message in partition with particular partition key? Is it possible?
You can't using console consumer and producer. But you can using higher level clients (in any language that works for you).
You may use for example assign method to manually assign a specific topic-partition to consume (https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java#L906)
You may use a custom Partitioner to override the partitioning logic where you will decide manually how to partition your messages (https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java#L206-L208)
With the many clients that are available you can specify the partition number just like serejja has stated.
Also look into https://github.com/cakesolutions/scala-kafka-client which uses actors and provides multiple modes for manual partitions and offsets.
If you want to do the same on the terminal, I suggest using kafkacat. (https://github.com/edenhill/kafkacat)
My personal choice during development.
You can do things like
kafkacat -b localhost:9092 -f 'Topic %t[%p], offset::: %o, data: %s key: %k\n' -t testtopic
And for a specific partition, you just need to use -p flag.
Console producer and consumer do not provide this flexibility. You could achieve this through Kafka APIs.
You could manually assign partition to consumer using assign() operation KafkaConsumer/Assign. This will disable group rebalancing. Please use this very carefully.
You could specify partition detail in KafkaProducer message. If not specified, it stores as per Partitioner policy.
How can I consume particular partition of the topic? (for instance
with partition key 13)
There is a flag called --partition in kafka-console-consumer
--partition <Integer: partition> The partition to consume from.
Consumption starts from the end of
the partition unless '--offset' is
specified.
The command is as follows:
bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic test --partition 0 --from-beginning
One way to do that as the Kafka documentation shows is through kafka.tools.MirrorMaker which can do that trick. However, I need to copy a topic (say test with 1 partition) (its content and meta data) from a production environment to a development environment where connectivity is not there. I could do simple file transfer between environments though.
My question: if I move the *.log and .index from the folder test-0 to the destination Kafka cluster, is that good enough? Or there is more that I need to do like meta data and ZooKeeper-related data that I need to move too?
Just copying the log and indexes will not suffice - kafka stores offsets and topic meta data in zookeeper. MirrorMaker is actually a quite simple tool, it spawns consumers to the source topic as well as producers to the target topic and runs until all consumers consumed the source queue. You can't find a simpler process to migrate a topic.
Use kafkacat
Unless your data is binary,
you can use a stock kafkacat.
Write topic to file:
kafkacat -b broker:9092 -e -K, -t my-topic > my-topic.txt
Write file back to topic:
kafkacat -b broker:9092 -K, -t my-topic -l my-topic.txt
If your data is binary,
you unfortunately have to build your own kafkacat from this branch, which is an as of yet unmerged PR.
Write topic with binary values to file:
kafkacat -b broker:9092 -e -Svalue=base64 -K, -t my-topic > my-topic.txt
Write file back to topic:
kafkacat -b broker:9092 -Svalue=base64 -K, -t my-topic -l my-topic.txt
What worked for me in your scenario was the following sequence of actions:
Create the topic in Kafka where you will later insert your files (with 1 partition and 1 replica and an appropriate retention.ms config so that Kafka doesn't delete your presumably outdated segments).
Stop your Kafka and Zookeeper.
Find the location of the files of the 0-partition you created in Kafka in step 1 (it will be something like kafka-logs-<hash>/<your-topic>-0).
In this folder, remove the existing files and copy your files to it.
Start Kafka and Zookeeper.
This also works if your Kafka is run from docker-compose (but you'll have to set up an appropriate volume, of course).