Lag to read message from kafka topic by storm spout - apache-kafka

While ingesting message on kafka topic, storm spout is not picking up immediately. There is lag of more than 1 hrs.
There is one spout and 3 bolt in Topology.
Spout- ddl
Bolt- kafkabolt, deletebolt, deletemapperbolt
Storm Config:
ddl.spout.executors: 3
topology.spout.executors: 10
topology.acker.executors: 3
topology.bolt.executors.kafkabolt: 2
topology.bolt.executors.deletebolt: 3
topology.bolt.tasks.deletebolt: 3
topology.max.spout.pending: 1
topology.bolt.executors.deletemapperbolt: 3
topology.bolt.tasks.deletemapperbolt: 3
topology.message.timeout.secs: 300
topology.max.task.parallelism: 100
topology.workers: 1
topology.debug: false
topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.receiver.buffer.size: 64
topology.transfer.buffer.size: 64

Related

AWS MSK , Kafka producer throughput relation with number of partitions

Partitions define the unit of parallelism in kafka , but increasing partitions may result in decreased producer throughput as due to replication,cluster bandwidth will decrease.
but in experiments it was observed ,
With 3 brokers: When we take 2 partitions on each broker then performance reduces as compared to 1 partition on each broker.
With 9 brokers: When we take 3 partitions on each broker then performance increases as compared to 1 partition on each broker.
Considering the scenerio of 3 brokers the performance should have degraded but it increased.
What can be the reason for such behaviour ??
Experiment details:
kafka-producer-perf-test was used to do benchmarking
Parameters passed to tool: --num-records 12000000 --throughput -1 acks=1 linger.ms=100 buffer.memory=5242880 compression.type=none request.timeout.ms=30000 --record-size 1000
Results of test in attached image

Why does Kafka Mirrormaker target topic contain half of original messages?

I want to copy all messages from a topic in Kafka cluster. So I ran Kafka Mirrormaker however it seems to have copied roughly only half of the messages from the source cluster (I checked that there's no consumer lag in source topic). I have 2 brokers in the source cluster does this have anything to do with this?
This is the source cluster config:
log.retention.ms=1814400000
transaction.state.log.replication.factor=2
offsets.topic.replication.factor=2
auto.create.topics.enable=true
default.replication.factor=2
min.insync.replicas=1
num.io.threads=8
num.network.threads=5
num.partitions=1
num.replica.fetchers=2
replica.lag.time.max.ms=30000
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
unclean.leader.election.enable=true
zookeeper.session.timeout.ms=18000
The source topic has 4 partitions and is not compacted. The Mirrormaker config is:
mirrormaker-consumer.properties
bootstrap.servers=broker1:9092,broker2:9092
group.id=picturesGroup3
auto.offset.reset=earliest
mirrormaker-producer.properties
bootstrap.servers=localhost:9092
max.in.flight.requests.per.connection=1
retries=2000000000
acks=all
max.block.ms=2000000000
Below are the stats from Kafdrop on the source cluster topic:
Partition
First Offset
Last Offset
Size
Leader Node
Replica Nodes
In-sync Replica Nodes
Offline Replica Nodes
Preferred Leader
Under-replicated
0
13659
17768
4109
1
1
1
Yes
No
1
13518
17713
4195
2
2
2
Yes
No
2
13664
17913
4249
1
1
1
Yes
No
3
13911
18072
4161
2
2
2
Yes
No
and these are the stats for the target topic after Mirrormaker run:
Partition
First Offset
Last Offset
Size
Leader Node
Replica Nodes
In-sync Replica Nodes
Offline Replica Nodes
Preferred Leader
Under-replicated
0
2132
4121
1989
1
1
1
Yes
No
1
2307
4217
1910
1
1
1
Yes
No
2
2379
4294
1915
1
1
1
Yes
No
3
2218
4083
1865
1
1
1
Yes
No
As you can see roughly only half of the source messages are in the target topic based on size column. What am I doing wrong?
I realized that the issue happened because I was copying data from a cluster with 2 brokers to a cluster with 1 broker. So I assume Mirrormaker1 just copied data from one broker from original cluster. When I configured the target cluster to have 2 brokers all of the messages were copied to it.
Regarding the advice of #OneCricketeer to use Mirrormaker2 this also worked however it took me a while to get to correct configuration file:
clusters = source, dest
source.bootstrap.servers = sourcebroker1:9092,sourcebroker2:9092
dest.bootstrap.servers = destbroker1:9091,destbroker2:9092
topics = .*
groups = mm2topic
source->dest.enabled = true
offsets.topic.replication.factor=1
offset.storage.replication.factor=1
auto.offset.reset=latest
In addition Mirrormaker2 can be found in connect container in this KafkaConnect project (enter the container and in the /kafka/bin directory there will be connect-mirror-maker.sh executable).
A major downside with Mirrormaker2 solution is it will add a prefix to the topics in target cluster (in my case new names would require changing application code). The prefix can't be changed in Mirrormaker2 configuration so the only way is to implement a custom Java class as explained here.

Why kafka the size of last log segment reduce when producer stop after a while?

I am trying to benchmark for kafka producer,
and i found the size of the last log segment reduce when producer stop after a while.
see the example 00000000000005746692.log below:
2020-05-23 08:40:35 /bin/du /data/kafka/json_test-0 -a
...
4 /data/kafka/json_test-0/00000000000004793445.snapshot
1048704 /data/kafka/json_test-0/00000000000004793445.log
68 /data/kafka/json_test-0/00000000000004793445.index
104 /data/kafka/json_test-0/00000000000004793445.timeindex
4 /data/kafka/json_test-0/00000000000005746692.snapshot
258176 /data/kafka/json_test-0/00000000000005746692.log
10240 /data/kafka/json_test-0/00000000000005746692.index
10240 /data/kafka/json_test-0/00000000000005746692.timeindex
6571068 /data/kafka/json_test-0
2020-05-23 08:40:38 /bin/du /data/kafka/json_test-0 -a
...
4 /data/kafka/json_test-0/00000000000004793445.snapshot
1048464 /data/kafka/json_test-0/00000000000004793445.log
68 /data/kafka/json_test-0/00000000000004793445.index
104 /data/kafka/json_test-0/00000000000004793445.timeindex
4 /data/kafka/json_test-0/00000000000005746692.snapshot
222224 /data/kafka/json_test-0/00000000000005746692.log
10240 /data/kafka/json_test-0/00000000000005746692.index
10240 /data/kafka/json_test-0/00000000000005746692.timeindex
6534876 /data/kafka/json_test-0
The size of /data/kafka/json_test-0/00000000000005746692.log from 258176 reduce to 222224.
Why kafka the size of last log segment reduce when producer stop after a while?
Edit:
kafka version: kafka_2.12-2.0.1
producer's compression.type: snappy
I doubt this log.preallocate configuration KIP-20, but config/server.properties not define this properties, it default to false log.preallocate
Data will be compressed by the producer, written in compressed format on the server and decompressed by the consumer.Compression

Explain why metricbeat Kafka partition metric has a higher count than consumer metric

The problem
Hi, I am trying to visualize Kafka lags using Grafana. I have been trying to log kafka lags using Metricbeat and doing the math myself since Metricbeat does not support logging Kafka lags in the version that I am using (but it has been implemented recently). Instead of using max(partition.offset.newest) - max(consumergroup.offset) to calculate the lags, I am using sum(partition.offset.newest) - sum(consumergroup.offset) filtered on a particular kafka.topic.name. However, the sum does not tally, upon further investigation, I found out that the count does not even tally! The count for partition offsets is 30 per 10s while the count for consumergroup offsets is 12 per 10s. I expect the count for both to be the same
I do not understand why Metricbeat logs the partition more than the consumergroup. At first I thought it was because of my Metricbeat configuration where I have 2 host groups defined, which might caused it to be logged multiple times. However, after changing my configurations, the count just droppped by half.
TL;DR
Why is the Metricbeat counts of partition and consumergroup different?
Setup
Kafka 2 brokers
Kafka topic partitions:
Topic: xxx PartitionCount:3 ReplicationFactor:2 Configs:
Topic: xxx Partition: 0 Leader: 2 Replicas: 2,1 Isr: 2,1
Topic: xxx Partition: 1 Leader: 1 Replicas: 1,2 Isr: 1,2
Topic: xxx Partition: 2 Leader: 2 Replicas: 2,1 Isr: 2,1
Metricbeat config (modules.d/kafka.yml):
- module: kafka
#metricsets:
# - partition
# - consumergroup
period: 10s
hosts: ["xxx.yyy:9092"]
Versions
Kafka 2.11-0.11.0.0
Elasticsearch-7.2.0
Kibana-7.2.0
Metricbeats-7.2.0
after much debugging I have figured out what is wrong:
For some reason, my kafka broker 1 has only producer metric and no consumer metric, connecting to broker 2 solved this problem. Connecting both brokers will add both metrics together.
Lucene uses fuzzy search so my data has some other consumer groups inside as well. For exact word matching, use kafka.partition.topic.keyword: 'xxx' instead. This made the ratio of my kafka producer offset to consumer offset 2:1
metricbeat logs the replicas as well, so I need to set NOT kafka.partition.partition.is_leader: false to get all partition leaders. This made the consumer to partition ratio 1:1.
After the 3 steps is done, I can use the formula sum(partition.offset.newest) - sum(consumergroup.offset) to get the lags
However, I do not know why broker 1 doesn't have the consumer information.

What does ProducerPerformance Tool in Kafka give?

What does running following Kafka tool actually give ?
./bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --throughput=10000--topic=TOPIC--num-records=50000000 --record-size=200 --producer-props bootstrap.servers=SERVERS buffer.memory=67108864 batch.size=64000
When running with a single producer I get 90MB/s. When I use 3 separate producers on separate nodes I get only around 60 MB/s per producer. ( My Kafka cluster consists of 2 nodes, and topic has 6 partitions )
What does 90 MB/s mean? Is it the maximum rate at which a producer can produce?
Does partition count affect this value?
Why it drops to 60 MB/s when there are 3 producers ( still no network saturation on broker front)?
Thank you