I have Kafka 0.9.
Today I need restart Kafka-Server.
But After restart I checked topics and didn't see all topics except standard
/opt/kafka/bin/kafka-topics.sh --list --zookeeper localhost:2181
__consumer_offsets
After start in kafka server logs exist this warning:
[2018-03-19 13:10:53,199] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/data/kafka/ae-result-from-0/00000000000000000000 index) has non-zero size but the last offset is 0 which is no larger than the base offset 0.}. deleting /data/kafka/ae-result-from-0/00000000000000000000.timeindex, /data/kafka/ae-result-from-0/00000000000000000000.index and rebuilding index... (kafka.log.Log)
how to recover topics correctly?
Related
I am trying to learn kafka and I had the below error:
[2021-01-21 13:46:43,247] WARN [ReplicaManager broker=0] Broker 0 stopped fetcher for partitions
__consumer_offsets-22,first_topic-2,__consumer_offsets-37,first_topic-0,__consumer_offsets-
38,__consumer_offsets-13,twitter_tweets-5,__consumer_offsets-30,twitter_tweets-3,__consumer_offsets-
8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-
7,__consumer_offsets-9,__consumer_offsets-46,new_topic-0,__consumer_offsets-25,__consumer_offsets-
35,twitter_tweets-0,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-
23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,__consumer_offsets-
32,__consumer_offsets-40 and stopped moving logs for partitions because they are in the failed log
directory C:\kafka_2.13-2.6.0\data\kafka. (kafka.server.ReplicaManager)
[2021-01-21 13:46:43,252] WARN Stopping serving logs in dir C:\kafka_2.13-2.6.0\data\kafka
(kafka.log.LogManager)
[2021-01-21 13:46:43,254] ERROR Shutdown broker because all log dirs in C:\kafka_2.13-2.6.0\data\kafka have failed (kafka.log.LogManager)
This happens every time I run a command for example:
bin\windows\kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic streams-plaintext-input.
When I delete all the offsets in /data folder then everything run smoothly. Is this happening because of the 7 day existing period that Kafka has?
Main issue is that Kafka depends on POSIX filesystem semantics that don't work well on windows.
Kafka uses specific features of POSIX to achieve high performance, so emulations—which happen on WSL 1—are insufficient. For example, the broker will crash when it rolls a segment file
This appears to be the error you're mentioning about the segment retention
If you want to use Kafka on windows, WSL2 is the suggested solution.
https://www.confluent.io/blog/set-up-and-run-kafka-on-windows-linux-wsl-2/
Also note: --zookeeper flag is deprecated
I have a topic that is compacted:
/opt/kafka/bin/kafka-topics.sh --zookeeper localhost --describe --topic myTopic
Topic:myTopic PartitionCount:1 ReplicationFactor:1 Configs:cleanup.policy=compact
There are no messages on it:
/opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic myTopic --from-beginning --property print-key=true
^CProcessed a total of 0 messages
Both the earliest and latest offset on the only partition that's there is 12, though.
/opt/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic myTopic --time -2
myTopic:0:12
/opt/kafka/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic myTopic --time -1
myTopic:0:12
I wonder what could have happened with these 12 messages? The number is correct, I was expecting them to be there, but for some reason they're gone.
As far as I understand, even if these 12 messages had the same key, I should have seen at least one - that's how the compaction works.
The topic in question was created as compacted. The only weird thing that might have happened during that time is that the Kafka instance lost its Zookeeper data completely. Is it possible that it also caused the data loss?
To rephrase the last question: can something bad happen with the physical data on Kafka if I remove all the Kafka-related ZNodes on Zookeeper?
In addition, here are some logs from Kafka startup.
[2019-04-30 12:02:16,510] WARN [Log partition=myTopic-0, dir=/var/lib/kafka] Found a corrupted index file corresponding to log file /var/lib/kafka/myTopic-0/00000000000000000000.log due to Corrupt index found, index file (/var/lib/kafka/myTopic-0/00000000000000000000.index) has non-zero size but the last offset is 0 which is no greater than the base offset 0.}, recovering segment and rebuilding index files... (kafka.log.Log)
[2019-04-30 12:02:16,524] INFO [Log partition=myTopic-0, dir=/var/lib/kafka] Completed load of log with 1 segments, log start offset 0 and log end offset 12 in 16 ms (kafka.log.Log)
[2019-04-30 12:35:34,530] INFO Got user-level KeeperException when processing sessionid:0x16a6e1ea2000001 type:setData cxid:0x1406 zxid:0xd11 txntype:-1 reqpath:n/a Error Path:/config/topics/myTopic Error:KeeperErrorCode = NoNode for /config/topics/myTopic (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-04-30 12:35:34,535] INFO Topic creation Map(myTopic-0 -> ArrayBuffer(0)) (kafka.zk.AdminZkClient)
[2019-04-30 12:35:34,547] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions myTopic-0
(kafka.server.ReplicaFetcherManager)
[2019-04-30 12:35:34,580] INFO [Partition myTopic-0 broker=0] No checkpointed highwatermark is found for partition myTopic-0 (kafka.cluster.Partition)
[2019-04-30 12:35:34,580] INFO Replica loaded for partition myTopic-0 with initial high watermark 0 (kafka.cluster.Replica)
[2019-04-30 12:35:34,580] INFO [Partition myTopic-0 broker=0] myTopic-0 starts at Leader Epoch 0 from offset 12. Previous Leader Epoch was: -1 (kafka.cluster.Partition)
And the messages were indeed removed:
[2019-04-30 12:39:24,199] INFO [Log partition=myTopic-0, dir=/var/lib/kafka] Found deletable segments with base offsets [0] due to retention time 10800000ms breach (kafka.log.Log)
[2019-04-30 12:39:24,201] INFO [Log partition=myTopic-0, dir=/var/lib/kafka] Rolled new log segment at offset 12 in 2 ms. (kafka.log.Log)
NoNode for /config/topics/myTopic
Kafka no longer knows this topic exists and that it should be compacted, which seems to be evident by the log cleaner logs
due to retention time 10800000ms breach
So yes, Zookeeper is very important. But so is gracefully shutting down a broker with kafka-server-stop, otherwise forcibly killing the process or the host machine would end up with corrupted partition segments
I'm not entirely sure what conditions would lead to this
the last offset is 0 which is no greater than the base offset 0
But assuming that you had a full cluster and that the topic had a replication factor higher than 1, then you could hope that at least one replica were healthy.
The way to recover a broker with a corrupted index/partition would be stop kafka process, delete the corrupted partition folder from disk, restart kafka on that machine, and then let it replicate back from a healthy instance
I am using kafka_2.12-1.1.0. Below are my retention conf. I am using Win OS.
log.retention.ms=120000
log.segment.bytes=2800
log.retention.check.interval.ms=30000
I see the logs get rolled over and generates *0000.log to *0069.log and then *00103.log etc and then i expected it to delete the old files but when the retention logic kicks in I get below error process cannot access the file because it is being used by another process and the kafka server crashes and doesnt recover from there. This happens when it is trying to delete 00000000000000000000.log ? Can you let me know the commands to check if they are active segments or why is it trying to delete this and crashing.
PFB log file if you want to see it.
[ProducerStateManager partition=myLocalTopic-0] Writing producer snapshot at offset 39
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Rolled new log segment at offset 39 in 39 ms.
[ProducerStateManager partition=myLocalTopic-0] Writing producer snapshot at offset 69
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Rolled new log segment at offset 69 in 52 ms.
[ProducerStateManager partition=myLocalTopic-0] Writing producer snapshot at offset 91
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Rolled new log segment at offset 91 in 18 ms.
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Found deletable segments with base offsets [0,69] due to retention time 120000ms breach (kafka.log.Log)
[ProducerStateManager partition=myLocalTopic-0] Writing producer snapshot at offset 103 (kafka.log.ProducerStateManager)
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Rolled new log segment at offset 103 in 25 ms. (kafka.log.Log)
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Scheduling log segment [baseOffset 0, size 2746] for deletion. (kafka.log.Log)
ERROR Error while deleting segments for myLocalTopic-0 in dir D:\tmp\kafka-logs (kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException: D:\tmp\kafka-logs\myLocalTopic-0\00000000000000000000.log -> D:\tmp\kafka-logs\myLocalTopic-0\00000000000000000000.log.deleted: The process cannot access the file because it is being used by another process.
............. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:98)
.......................................................
ERROR Uncaught exception in scheduled task 'kafka-log-retention' (kafka.utils.KafkaScheduler)
org.apache.kafka.common.errors.KafkaStorageException: Error while deleting segments for myLocalTopic-0 in dir D:\tmp\kafka-logs
Caused by: java.nio.file.FileSystemException: D:\tmp\kafka-logs\myLocalTopic-0\00000000000000000000.log -> D:\tmp\kafka-logs\myLocalTopic-0\00000000000000000000.log.deleted: The process cannot access the file because it is being used by another process.
Click here for file list image
Apache Kafka provides us with the following retention policies
Time based Retention
Size based Retention
For time based retention try the following commands,
Command to set retention time
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config retention.ms=1680000
Command to remove Topic level retention time
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --delete-config retention.ms
For size based retention try the following commands,
To set retention size:
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config retention.bytes=104857600
To remove,
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --delete-config retention.bytes
I was running my services that work with kafka already for a year and no spontaneous changes of leader happens.
But for the last 2 weeks that started happens quite often.
Kafka log on that:
[2015-09-27 15:35:14,826] INFO [ReplicaFetcherManager on broker 2]
Removed fetcher for partitions [myTopic] (kafka.server.ReplicaFetcherManager)
[2015-09-27 15:35:14,830] INFO Truncating log myTopic-0 to offset 11520979. (kafka.log.Log)
[2015-09-27 15:35:14,845] WARN [Replica Manager on Broker 2]: Fetch request with correlation id 713276 from client ReplicaFetcherThread-0-2 on partition [myTopic,0] failed due to Leader not local for partition [myTopic,0] on broker 2 (kafka.server.ReplicaManager)
[2015-09-27 15:35:14,857] WARN [Replica Manager on Broker 2]: Fetch request with correlation id 256685 from client mirrormaker-1 on partition [myTopic,0] failed due to Leader not local for partition [myTopic,0] on broker 2 (kafka.server.ReplicaManager)
[2015-09-27 15:35:20,171] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [myTopic,0] (kafka.server.ReplicaFetcherManager)
What can cause switching leader? If there is info in some kafka documentation - please - just point the link. I've failed to find.
System configuration
kafka version: kafka_2.10-0.8.2.1
os: Red Hat Enterprise Linux Server release 6.5 (Santiago)
server.properties (differs from default):
broker.id=001
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.flush.interval.messages=10000
log.flush.interval.ms=1000
log.retention.bytes=-1
controlled.shutdown.enable=true
auto.create.topics.enable=false
It appears like lead broker is down for that partition. It might be that data directroy(log.dirs) configured in server.properties is out of space and broker is not able to accommodate.
Also, what is replication factor of topic and cluster size of brokers?
I am assuming you have one topic and one partition with a replication factor of 2. Which is not a good configuration for optimal Kafka performance and consumers.
Your Logs are not clear enough for leader switch. Major issue in your topic may be having the only one leader due to the only partition. Now the single file in your logs is getting bigger in size day by day. Kafka internally does rebalancing at some level(details are not confirmed). That can be the reason for your leader switch. But i am not sure.
Also in your 2nd log line its says some of the logs are truncated. Can you please go though the logs in details and check is this happening only after truncation?
As you already mentioned you already checked your Kafka log directory files and their size. Please run the describe when you got this issue. The leader switch will reflect here as well. Or if you can setup some dashboard that will display the leader for past time. Then it will be easy for you to find the root cause.
bin/kafka-topics.sh --describe --zookeeper Zookeeperhost:Port --topic TopicName
Suggestion: i will suggest you to create a new topic with more partitions(read Kafka documentation to get a good idea about optimum number of partitions) and start writing to it. Or you can check, how to change partitions for current topic.
Last Thing: Is leader switch causing some issues in your Clients or you are worried only about warnings?
I am under the impression that with two brokers with sync turned on my kafka setup should keep on working even on fail of one of the broker.
To test it I made a new topic named topicname. Its description is as follows:
Topic:topicname PartitionCount:1 ReplicationFactor:1 Configs:
Topic: topicname Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Then I ran producer.sh and consumer.sh in the following way:
bin/kafka-console-producer.sh --broker-list localhost:9092,localhost:9095 sync --topic topicname
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic topicname --from-beginning
Till both the brokers were working I saw that messages were being received properly by the consumer, but when I killed one of the instance of the brokers through kill command then the consumer stopped showing me any new messages. Instead it showed me the following error message:
WARN [ConsumerFetcherThread-console-consumer-57116_ip-<internalipvalue>-1438604886831-603de65b-0-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 865; ClientId: console-consumer-57116; ReplicaId: -1; MaxWait: 100 ms; MinBytes: 1 bytes; RequestInfo: [topicname,0] -> PartitionFetchInfo(9,1048576). Possible cause: java.nio.channels.ClosedChannelException (kafka.consumer.ConsumerFetcherThread)
[2015-08-03 12:29:36,341] WARN Fetching topic metadata with correlation id 1 for topics [Set(topicname)] from broker [id:0,host:<hostname>,port:9092] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
I had this similar problem, setting the producer config "topic.metadata.refresh.interval.ms" to -1 (or whatever value is suitable for you) solved the issue for me.
So in my case , I had 3 broker (multi broker set up on my local machine) and created the topic with 3 partitions and replication factor 2.
Test set up:
Before the producer config:
Tried 3 brokers running , killed one of the brokers after producer started, the local Zookeeper updated the ISR and topic metadata info (removed down broker as leader) but the producer did not pick it up (may be due to default 10 mins refresh time).So messages end up failing. I get send exceptions.
After the producer config (-1 in my case):
Tried 3 brokers running , killed one of the brokers after producer started, the local Zookeeper updated the ISR info (removed down broker as leader), the producer refreshed the new ISR/topic metadata info and messages send did not fail.
-1 makes it refresh topic metadata on each failed attempt so may be you want to reduce the refresh time to something reasonable instead.
I think there are two things can make your consumer not work after a broker down for kafka HA cluster:
--replication-factor should bigger than 1 for your topic. so every topic partition can have at least one backup.
replication factor for internal topics for kafka configuration should also bigger than 1:
offsets.topic.replication.factor = 3
transaction.state.log.replication.factor = 3
transaction.state.log.min.isr = 2
This two modification make my producer and consumer still work after broker shutdown (5 broker and every broker goes down once) .
You can see in the topic description that you posted that your topic has only a single replica.
With a single replica there is no fault tolerance and if broker 0 (the broker that contains the replica) goes away, the topic will be unavailable.
Create a topic with more replicas (with --replication-factor 3) to have fault tolerance in case of crashes.
I had run into into the same problem even when using a topic with replication factor of 2.
Setting the following property on the producer worked for me.
"metadata.max.age.ms". (Kafka-0.8.2.1)
Else, my Producer was waiting for 1 minute by default to fetch the new leader and start contacting it
For a topic with replication factor N, Kafka tolerate up to N-1 server failures. E.g. having a replication factor 3 will allow you to handle upto 2 server failure.