kafka-producer-perf-test.sh throwing corrupt message error and not producing any results. What's the reason.
I am trying to run a performance test on my Kafka cluster using the shell scripts provided with Kafka, It is generating CORRUPT_MESSAGE ERROR.
Error: CORRUPT_MESSAGE (org.apache.kafka.clients.producer.internals.Sender)
[2019-01-03 10:33:09,119] WARN [Producer clientId=producer-1] Got error produce response with correlation id 2396 on topic-partition my_topic-1, retrying (2147483519 attempts left). Error: CORRUPT_MESSAGE (org.apache.kafka.clients.producer.internals.Sender)
I am running the following command to run the test
bin/kafka-producer-perf-test.sh --topic my_topic --num-records 50 --throughput 10 --producer-props bootstrap.servers=kafka1:9092 key.serializer=org.apache.kafka.common.serialization.StringSerializer value.serializer=org.apache.kafka.common.serialization.StringSerializer --record-size 1
What could be the reason behind this?
Edit 1:
We have a cluster of 5 brokers r5.xlarge machines (4 cores, 32 GB RAM) with heap option Xmx3G.
Related
I am trying to perform a benchmark test in our kafka env. I have played with few configurations such as request.timeout.ms and max.block.ms and throughout but not able to avoid the error:
org.apache.kafka.common.errors.TimeoutException: The request timed out.
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
org.apache.kafka.common.errors.TimeoutException: Expiring 148 record(s) for benchmark-6-3r-2isr-none-0: 182806 ms has passed since last append
Produce Perf Test command:
nohup sh ~/kafka/kafka_2.11-1.0.0/bin/kafka-producer-perf-test.sh --topic benchmark-6p-3r-2isr-none --num-records 10000000 --record-size 100 --throughput 1000 --print-metrics --producer-props acks=all bootstrap.servers=node1:9092,node2:9092,node3:9092 request.timeout.ms=180000 max.block.ms=180000 buffer.memory=100000000 > ~/kafka/load_test/results/6p-3r-10M-100B-t-1-ackall-rto3m-block2m-bm100m-2 2>&1
Cluster: 3 nodes, topic: 6 partitions, RF=3 and minISR=2
I am monitoring the kafka metrics using a tsdb and grafana. I know that disk IO perf is bad [disk await(1.5 secs), IO queue size and disk utilization metrics are high(60-75%)] but I don't see any issue in kafka logs that can relate slow disk io to the above perf errors.
But I get the error even for 1000 messages/sec.
Need suggestions to understand the issue and fix the above errors?
I have another very disturbing observation.
The errors go away if I start 2 kafka-producer-perf-test.sh with the same configs on different hosts.
If I cancel 1 kafka-producer-perf-test.sh then after some time the above errors start reappearing.
I am new to Kafka and seem to be having several issues with the 'Quickstart' guide for Apache Kafka found here:
https://kafka.apache.org/quickstart#quickstart_kafkaconnect
Ultimately I am trying to learn how to load a kafka queue with many kafka messages and so the Step 7 part of this Quickstart guide seemed relevant.
I installed the binary download (Scala 2.11 - kafka_2.11-1.1.0.tgz ) found here:
https://kafka.apache.org/downloads
I had initially tried to jump straight to step 7 but realised after finding this question (Kafka Connect implementation errors) I had to do the few steps prior to that
Therefore I followed the first step successfully:
tar -xzf kafka_2.11-1.1.0.tgz
cd kafka_2.11-1.1.0
Then I followed step 2:
bin/zookeeper-server-start.sh config/zookeeper.properties
But I get the error
ERROR Unexpected exception, exiting abnormally (org.apache.zookeeper.server.ZooKeeperServerMain)
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:90)
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:117)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:87)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:53)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
But when I run the next command in that same step:
bin/kafka-server-start.sh config/server.properties
The Kafka server seems to run successfully?
So then I tried to continue to step 3 to create a topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
But this produces the error:
Error while executing topic command : Replication factor: 1 larger than available brokers: 0.
[2018-04-09 14:13:26,908] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 1 larger than available brokers: 0.
(kafka.admin.TopicCommand$)
Then trying step 4:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This seems to work and I can write a message but then I get a connection error (which is probably due to the fact previous steps haven't worked successfully)
kafka_2.11-1.1.0 user$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>This is a message
[2018-04-09 14:17:52,631] WARN [Producer clientId=console-producer] Connection to node -1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-04-09 14:17:52,687] WARN [Producer clientId=console-producer] Connection to node -1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
Does anyone know why these issues are occurring and how I can fix them? I can't find anymore inforomation in that tutorial about these problems
As the error suggests, you have something running on the default port for ZK. Either close it or change the zookeeper properties file to use another port.
Address localhost:2181 is already in use. Since Zookeeper cannot start, then Kafka brokers won't start too. replication-factor must be less or equal to the number of available brokers, and since no broker is available then the following error will be reported (even if you are using --replication-factor 1).
Error while executing topic command : Replication factor: 1 larger than available brokers: 0.
[2018-04-09 14:13:26,908] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 1 larger than available brokers: 0.
(kafka.admin.TopicCommand$)
You either need stop the process which is running in 2181 or change the ZK default port to a port which is not currently in use.
To see what is running (PID) in port 2181, run
lsof -i -n -P | grep 2181
If you want to kill that process, then run
kill -9 PID
where PID is the process ID which you can get from lsof command.
Otherwise, you need to change the port in the zookeeper.properties file by modifying the parameter clientPort=2181. And finally, you need to change zookeeper.connect=localhost:2181 parameter in the server.properties file accordingly.
While I was doing a proof of concept with kafka-flink, I discovered the following : It seems that kafka producer errors could happen due to workload done on flink side ?!
Here are more details:
I have sample files like sample??.EDR made of ~700'000 rows with values like "entity", "value", "timestamp"
I use the following command to create the kafka topic:
~/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 1 --replication-factor 1 --topic gprs
I use the following command to load sample files on topic:
[13:00] kafka#ubu19: ~/fms
% /home/kafka/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic gprs < ~/sample/sample01.EDR
I have on flink side jobs that aggregate value for each entity with sliding window of 6 hours and 72 hours (aggregationeachsix, aggregationeachsentytwo).
I did three scenarios:
Load files in the topic without any job running
Load files in the topic with aggregationeachsix job running
Load files in the topic with aggregationeachsix and aggregationeachsentytwo jobs running
The results is that first two scenarios are working but for the third scenario, I have the following errors on kafka producer side while loading the files (not always at the same file, it can be the first, second, third or even later file):
[plenty of lines before this part]
[2017-08-09 12:56:53,409] ERROR Error when sending message to topic gprs with key: null, value: 35 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 233 record(s) for gprs-0: 1560 ms has passed since last append
[2017-08-09 12:56:53,409] ERROR Error when sending message to topic gprs with key: null, value: 37 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 233 record(s) for gprs-0: 1560 ms has passed since last append
[2017-08-09 12:56:53,409] ERROR Error when sending message to topic gprs with key: null, value: 37 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 233 record(s) for gprs-0: 1560 ms has passed since last append
[2017-08-09 12:56:53,412] WARN Got error produce response with correlation id 1627 on topic-partition gprs-0, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2017-08-09 12:56:53,412] WARN Got error produce response with correlation id 1626 on topic-partition gprs-0, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2017-08-09 12:56:53,412] WARN Got error produce response with correlation id 1625 on topic-partition gprs-0, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2017-08-09 12:56:53,412] WARN Got error produce response with correlation id 1624 on topic-partition gprs-0, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2017-08-09 12:56:53,515] ERROR Error when sending message to topic gprs with key: null, value: 35 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 8 record(s) for gprs-0: 27850 ms has passed since batch creation plus linger time
[2017-08-09 12:56:53,515] ERROR Error when sending message to topic gprs with key: null, value: 37 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 8 record(s) for gprs-0: 27850 ms has passed since batch creation plus linger time
[plenty of lines after this part]
My question is why flink could have an impact on kafka producer and then, what do I need to change to avoid this error ?
It looks like you are saturating your network when both flink and kafka-producer are using it and thus you get TimeoutExceptions.
I'm getting below error when running producer client, which take messages from an input file kafka_message.log. This log file is pilled with 100000 records per second of each message of length 4096
error -
[2017-01-09 14:45:24,813] ERROR Error when sending message to topic test2R2P2 with key: null, value: 4096 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Batch containing 3 record(s) expired due to timeout while requesting metadata from brokers for test2R2P2-0
[2017-01-09 14:45:24,816] ERROR Error when sending message to topic test2R2P2 with key: null, value: 4096 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Batch containing 3 record(s) expired due to timeout while requesting metadata from brokers for test2R2P2-0
[2017-01-09 14:45:24,816] ERROR Error when sending message to topic test2R2P2 with key: null, value: 4096 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Batch containing 3 record(s) expired due to timeout while requesting metadata from brokers for test2R2P2-0
[2017-01-09 14:45:24,816] ERROR Error when sending message to topic test2R2P2 with key: null, value: 4096 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Batch containing 3 record(s) expired due to timeout while requesting metadata from brokers for test2R2P2-0
[2017-01-09 14:45:24,816] ERROR Error when sending message to topic test2R2P2 with key: null, value: 4096 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Batch containing 3 record(s) expired due to timeout while requesting metadata from brokers for test2R2P2-0
command i run :
$ bin/kafka-console-producer.sh --broker-list x.x.x.x:xxxx,x.x.x.x:xxxx --batch-size 1000 --message-send-max-retries 10 --request-required-acks 1 --topic test2R2P2 <~/kafka_message.log
there are 2 brokers running and the topic has partitions = 2 and replication factor = 2.
can some please help me understand what this error means? i also see loss of message meaning not all the messages from input file is put into the topic?
on a seperate note: i see data loss when running kafka-producer-perf-test.sh and killing one of the broker (in 3 node cluster) when the test is running. is this a expected behavior? i see same results for multiple tests.
commands i run:
describe topic:
$ bin/kafka-topics.sh --zookeeper x.x.x.x:2181/kafka-framework --describe |grep test4
Topic:test4R2P2 PartitionCount:2 ReplicationFactor:2 Configs:
Topic: test4R2P2 Partition: 0 Leader: 0 Replicas: 1,0 Isr: 0,1
Topic: test4R2P2 Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
run perf test:
$ bin/kafka-producer-perf-test.sh --num-records 100000 --record-size 4096 --throughput 1000 --topic test4R2P2 --producer-props bootstrap.servers=x.x.x.x:xxxx,x.x.x.x:xxxx
consumer command:
$ bin/kafka-console-consumer.sh --zookeeper x.x.x.x:2181/kafka-framework --topic test4R2P2 1>~/kafka_message.log
checking message count:
$ wc -l ~/kafka_message.log
399418 /home/montana/kafka_message.log
i see only 399418 messages in the topic test4R2P2, where as i have put total 400000 messages by running perf test 4 times.
exception thrown by perf command:
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
exceptions thrown by consumer command:
[2017-01-10 07:40:07,246] WARN [ConsumerFetcherThread-console-consumer-46599_node-44a8422fe1a0-1484033822261-f07d33d7-0-1], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest#695be565 (kafka.consumer.ConsumerFetcherThread)
[2017-01-10 07:40:07,472] WARN Fetching topic metadata with correlation id 1 for topics [Set(test4R2P2)] from broker [BrokerEndPoint(1,10.105.26.1,31052)] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
[2017-01-10 07:42:23,073] WARN [ConsumerFetcherThread-console-consumer-46599_node-44a8422fe1a0-1484033822261-f07d33d7-0-0], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest#7bd94073 (kafka.consumer.ConsumerFetcherThread)
[2017-01-10 07:44:58,195] WARN [ConsumerFetcherThread-console-consumer-46599_node-44a8422fe1a0-1484033822261-f07d33d7-0-1], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest#2855ee73 (kafka.consumer.ConsumerFetcherThread)
[2017-01-10 07:44:58,404] WARN Fetching topic metadata with correlation id 3 for topics [Set(test4R2P2)] from broker [BrokerEndPoint(1,10.105.26.1,31052)] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
[2017-01-10 07:45:47,127] WARN [ConsumerFetcherThread-console-consumer-46599_node-44a8422fe1a0-1484033822261-f07d33d7-0-0], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest#f8887da (kafka.consumer.ConsumerFetcherThread)
[2017-01-10 07:50:56,291] ERROR [ConsumerFetcherThread-console-consumer-46599_node-44a8422fe1a0-1484033822261-f07d33d7-0-1], Error for partition [test4R2P2,1] to broker 1:kafka.common.NotLeaderForPartitionException (kafka.consumer.ConsumerFetcherThread)
Based on the comments, this suggestion from #amethystic seems to do the trick:
...you could increase the value for "request.timeout.ms" ...
When I try to use the Kafka producer and consumer (0.9.0) script to push/pull messages from a topic, I get the errors below.
Producer Error
[2016-01-13 02:49:40,078] ERROR Error when sending message to topic test with key: null, value: 11 bytes with error: Failed to update metadata after 60000 ms. (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
Consumer Error
> [2016-01-13 02:47:18,620] WARN
> [console-consumer-90116_f89a0b380f19-1452653212738-9f857257-leader-finder-thread],
> Failed to find leader for Set([test,0])
> (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread)
> kafka.common.KafkaException: fetching topic metadata for topics
> [Set(test)] from broker
> [ArrayBuffer(BrokerEndPoint(0,192.168.99.100,9092))] failed at
> kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:73) at
> kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:94) at
> kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
> Caused by: java.io.EOFException at
> org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
> at
> kafka.network.BlockingChannel.readCompletely(BlockingChannel.scala:129)
> at kafka.network.BlockingChannel.receive(BlockingChannel.scala:120)
> at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:77)
> at
> kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:74)
> at kafka.producer.SyncProducer.send(SyncProducer.scala:119) at
> kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:59)
> ... 3 more
Why am I getting the error, and how do I resolve it?
Configuration
Running all components in Docker containers on Mac. ZooKeeper and Kafka running in separate Docker containers.
Docker Machine (boot2docker) IP Address: 192.168.99.100
ZooKeeper Port: 2181
Kafka Port: 9092
Kafka configuration file server.properties sets the following:
host.name=localhost
broker.id=0
port=9092
advertised.host.name=192.168.99.100
advertised.port=9092
Commands
I run the following commands from within the kafka server Docker container. I've already created a topic with one partition and a replication factor of 1.
Notice the leader designation is 0 which might be part of the problem.
root#f89a0b380f19:/opt/kafka/dist# ./bin/kafka-topics.sh --zookeeper 192.168.99.100:2181 --topic test --describe
Topic:test PartitionCount:1 ReplicationFactor:1 Configs:
Topic: test Partition: 0 Leader: 0 Replicas: 0 Isr: 0
I then do the following to send some messages:
root#f89a0b380f19:/opt/kafka/dist# ./bin/kafka-console-producer.sh --broker-list 192.168.99.100:9092 --topic test
one message
two message
three message
four message
[2016-01-13 02:49:40,078] ERROR Error when sending message to topic test with key: null, value: 11 bytes with error: Failed to update metadata after 60000 ms. (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
[2016-01-13 02:50:40,080] ERROR Error when sending message to topic test with key: null, value: 11 bytes with error: Failed to update metadata after 60000 ms. (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
[2016-01-13 02:51:40,081] ERROR Error when sending message to topic test with key: null, value: 13 bytes with error: Failed to update metadata after 60000 ms. (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
[2016-01-13 02:52:40,083] ERROR Error when sending message to topic test with key: null, value: 12 bytes with error: Failed to update metadata after 60000 ms. (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
This is the command I'm using to attempt to consume messages which yields the consumer error I posted above.
root#f89a0b380f19:/opt/kafka/dist# ./bin/kafka-console-consumer.sh --zookeeper 192.168.99.100:2181 --topic test --from-beginning
I've confirmed ports 2181 and 9092 are open and accessible from within the Kafka Docker container:
root#f89a0b380f19:/# nc -z 192.168.99.100 2181; echo $?;
0
root#f89a0b380f19:/# nc -z 192.168.99.100 9092; echo $?;
0
The solution wasn't what I expected at all. The error message did not line up with what was really happening.
The primary problem was mounting the log directory in Docker to my local file system. My docker run command used a volume mount to mount the Kafka log.dir folder in the container to a local directory on the host VM which was actually mounted to my Mac. It's that latter point that was the problem.
For instance,
docker run --name kafka -v /Users/<me>/kafka/logs:/var/opt/kafka:rw -p 9092:9092 -d kafka
Since I'm on a Mac and use docker-machine (e.g. boot2docker), I have to mount through my /Users/ path which boot2docker auto-mounts in the host VM. Because the underlying VM itself uses a bind mount, Kafka's I/O engine wasn't able to communicate with it correctly. If the volume mount was to a directory directly on the host Linux VM (i.e. boot2docker machine) it would work.
I can't explain the exact details since I don't know the ins-and-outs of Kafka I/O, but when I remove the mounted volume to my Mac file system it works.