Kafka: deleting messages from topics with retention "compact" - apache-kafka

I am trying to implement a minimal working example on compacted topics in Kafka with Java. I got the compaction working well, but cannot see deletes happening when I write messages with a key and a null value as described in the kafka documentation.
Version of library used: kafka-clients-0.10.0.0.jar
Here is a gist of a Java class reproducing the behaviour:
https://gist.github.com/anonymous/f78184eaeec3ee82b15182aec24a432a
Furthermore, consulting the documentation, I used the following configuration on a topic level for compaction to kick in as quickly as possible:
min.cleanable.dirty.ratio=0.01
cleanup.policy=compact
segment.ms=100
delete.retention.ms=100
On the server.properties side, just to be sure:
log.retention.check.interval.ms=100
log.cleaner.delete.retention.ms=100
log.cleaner.enable=true
log.cleaner.min.cleanable.ratio=0.01
When run, this class shows that compaction works - there is only ever one message with the same key on the topic. However, I still see the message with the "null" value, that should have been deleted in my opinion.
I can see the cleaner threads running, producing output like:
[2016-08-11 12:30:21,032] INFO Cleaner 0: Cleaning segment 15 in log compaction-test-0 (last modified Thu Aug 11 12:29:52 CEST 2016) into 0, retaining deletes. (kafka.log.LogCleaner)
Does anyone know why it's "retaining deletes"? Am I missing any relevant configuration option? Am I writing "null" in the correct way?
Any ideas are greatly appreciated. Thanks in advance!
UPDATE: After investigating helpful comments, I upgraded to 0.10.0.1 and found the following output in the cleaner log:
[2016-08-15 12:44:57,412] INFO Cleaner 0: Cleaning log compaction-test-0 (discarding tombstones prior to Mon Aug 15 12:44:40 CEST 2016)... (kafka.log.LogCleaner)
[2016-08-15 12:44:57,412] INFO Cleaner 0: Cleaning segment 0 in log compaction-test-0 (last modified Mon Aug 15 12:44:41 CEST 2016) into 0, retaining deletes. (kafka.log.LogCleaner)
[2016-08-15 12:44:57,412] INFO Cleaner 0: Cleaning segment 15 in log compaction-test-0 (last modified Mon Aug 15 12:44:41 CEST 2016) into 0, retaining deletes. (kafka.log.LogCleaner)
[2016-08-15 12:44:57,413] INFO Cleaner 0: Cleaning segment 16 in log compaction-test-0 (last modified Mon Aug 15 12:44:56 CEST 2016) into 0, retaining deletes. (kafka.log.LogCleaner)
As "retaining deletes" is set by
val retainDeletes = old.lastModified > deleteHorizonMs
and the last modification date of the segment in question always seems slightly later than the delete horizon, deleting doesn't happen in my minimal example.
Just wondering how to adjust settings or test to deal with this now...

This problem has been fixed in 0.10.1. See this JIRA: https://issues.apache.org/jira/browse/KAFKA-4015

Related

Recovering Kafka Cluster from a disk full error

We have a 3-node Kafka cluster. For data storage we have 2 mounted disks - /data/disk1 and /data/disk2 in each of the 3 nodes. The log.dirs setting in kafka.properties is:
log.dirs=/data/disk1/kafka-logs,/data/disk2/kafka-logs
It so happened that in one of nodes Node1, the disk partition /data/disk2/kafka-logs got 100% full.
The reason this happened is - we were replaying data from filebeat to a kafka topic and a lot of data got pushed in a very short time. I've temporarily changed the retention for that topic to 1 day from 7 days and so the topic size has become normal.
The problem is - in Node1 which has got /data/disk2/kafka-logs 100% full, kafka process just wouldn't start and emit the error:
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,093] INFO Recovering unflushed segment 0 in log my-topic-0. (kafka.log.Log)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,094] INFO Completed load of log my-topic-0 with 1 log segments and log end offset 0 in 2 ms (kafka.log.Log)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,095] ERROR There was an error in one of the threads during logs loading: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code (kafka.log.LogManager)
Jul 08 12:03:29 broker01 kafka[23949]: [2019-07-08 12:03:29,101] FATAL [Kafka Server 1], Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
Jul 08 12:03:29 broker01 kafka[23949]: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
Jul 08 12:03:29 broker01 kafka[23949]: at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
Jul 08 12:03:29 broker01 kafka[23949]: at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
Jul 08 12:03:29 broker01 kafka[23949]: at org.apache.kafka.common.record.FileLogInputStream$FileChannelLogEntry.loadRecord(FileLogInputStream.java:135)
Jul 08 12:03:29 broker01 kafka[23949]: at org.apache.kafka.common.record.FileLogInputStream$FileChannelLogEntry.record(FileLogInputStream.java:149)
Jul 08 12:03:29 broker01 kafka[23949]: at kafka.log.LogSegment.$anonfun$recover$1(LogSegment.scala:22
The replication factor for most topics is either 2 or 3. So, I'm wondering if I can do the following:
Change replication factor to 2 for all the topics (Node 2 and Node 3 are running fine)
delete some stuff from Node1.
Restart Node 1
Change replication factor back to 2 or 3 as was the case initially.
Does anyone know of a better way or a better suggestion?
Update: Step 1 and 4 not needed. Just 2 and 3 are enough if you have replicas.
Your problem (and solution accordingly) is similar to that described in this question: kafka 0.9.0.1 fails to start with fatal exception
The easiest and fastest way is to delete part of the data. When the broker started, the data is replicated with the new retention.
So, I'm wondering if I can do the following...
Answering your question specifically - yes, you can do the steps you described in sequence and this will help to return the cluster to a consistent state.
To prevent this from happening in the future, you can try using the parameter log.retention.bytes instead of log.retention.hours, although I believe that the use of size-based retention policy for logs is not the best choice, because as my practice shows in most cases it is necessary to know the time at least of which the topic will be stored in the cluster.

JBoss Fuse Server not starting, giving below error

I am getting below error in Fuse log :
`Sep 14, 2017 5:21:57 AM org.apache.karaf.main.SimpleFileLock lock
INFO: locking
Sep 14, 2017 5:21:58 AM org.apache.karaf.main.SimpleFileLock lock
INFO: locking
Sep 14, 2017 5:21:59 AM org.apache.karaf.main.SimpleFileLock lock
INFO: locking
Sep 14, 2017 5:22:00 AM org.apache.karaf.main.SimpleFileLock lock
INFO: locking
Usually this happens because there is already another Fuse instance running, with the same current directory. If you started Fuse in a way that doesn't create a console, it's not difficult to forget that you did so, and try to start it again. Fuse is designed to work this way -- it's not an error to try to start multiple instances with the same working directory: this is often done for redundancy. But it can be hard to spot if you've done it by accident. Use "jps" or something to see if another JVM is running.
There are other potential causes of this problem that can be explored, but it's worth ruling out the simple explanations first.

lmtpd: failed to mmap file /var/lib/imap/deliver.db.NEW (in reply to end of DATA command)

Good day!
After installing and running kolab letters delivered instantly. But after a few days letters to local destinations have become delivered with a delay. Over time, they are delivered, but the delay may be several hours. An example of the path of the letter:
root#myhost:~# cat /var/log/mail.log | grep 7AA7935B1FC
Jan 12 11:31:03 myhost postfix/smtpd[19494]: 7AA7935B1FC:
client=localhost[127.0.0.1]
Jan 12 11:31:05 myhost postfix/cleanup[19492]: 7AA7935B1FC:
message-id=<20160112093103.7AA7935B1FC#mail.myhost.com>
Jan 12 11:31:05 myhost postfix/qmgr[7021]: 7AA7935B1FC:
from=<noreply#myhost.com>, size=1279, nrcpt=3 (queue active)
Jan 12 11:31:05 myhost lmtpunix[19631]: Delivered:
<20160112093103.7AA7935B1FC#mail.myhost.com> to mailbox:
myhost.com!user.user1
Jan 12 11:31:06 myhost postfix/lmtp[19617]: 7AA7935B1FC: to=<user1#myhost.com>, relay=mail.myhost.com[/var/lib/imap/socket/lmtp], delay=2.6, delays=2/0.01/0/0.59, dsn=4.3.0, status=deferred (host
mail.myhost.com[/var/lib/imap/socket/lmtp] said: 421 4.3.0 lmtpd:
failed to mmap /var/lib/imap/deliver.db.NEW file (in reply to end of
DATA command))
Jan 12 11:31:06 myhost postfix/lmtp[19617]: 7AA7935B1FC: to=<user2#myhost.com>, relay=mail.myhost.com[/var/lib/imap/socket/lmtp], delay=2.7, delays=2/0.01/0/0.68, dsn=4.4.2, status=deferred (lost connection with mail.myhost.com[/var/lib/imap/socket/lmtp] while sending end of data
-- message may be sent more than once
Jan 12 11:31:07 myhost postfix/lmtp[19617]: 7AA7935B1FC: to=<user3#myhost.com>, relay=mail.myhost.com[/var/lib/imap/socket/lmtp], delay=2.7, delays=2/0.01/0/0.68, dsn=4.4.2, status=deferred (lost connection with mail.myhost.com[/var/lib/imap/socket/lmtp] while sending end of data
-- message may be sent more than once)
Currently mailq features a variety of messages in queue. An example of one of these:
7BBDF35B123 6162 Tue Jan 12 13:19:24 user#rambler.ru (delivery temporarily suspended: lost connection with mail.myhost.com[/var/lib/imap/socket/lmtp] while sending end of data -- message may be sent more than once) user4#myhost.com
-- 11667 Kbytes in 327 Requests.
I think that the main reason is described here:
lmtp: failed to mmap /var/lib/imap/deliver.db.NEW file
But, unfortunately, not been able to find a solution.
The problem was solved according to this recommendation: http://lists.kolab.org/pipermail/users-de/2015-May/001998.html
Stop Services cyrus-imap and postfix
Delete files deliver.db.NEW and deliver.db in the directory /var/lib/imap/
Start the services and the file deliver.db is automatically created
Restart the queue: postsuper -r ALL
Some of the letters delivered from the queue again.
Proposed cause: after installing and start services on the new server users download messages en masse in the format *.eml, downloaded from the last post. Perhaps these actions somehow overflowed index files.
P.S.: Unfortunately, the solution was temporary: the situation described above is repeated periodically :(

Can not retrieve kafka offset for a topic/group/partition from zookeeper

We are running zookeeper 3.3 and kafka 0.8 on separate servers. We are using HL consumers and they access the data in kafka queue as expected, on a restart they pick up form where they left off. So, the consumers behave as expected.
The problem is we can't see the offsets in the zookeeper when we use zkCli.sh. For now the consumer is set up for running in one partition only for a topic .
the cmd "ls /consumers/mygrpid/offsets/mytopic/0" returns [].
same for "ls /consumers/mygrpid/owners/mytopic", it returns [].
Because the consumer behaves as expected when the consumer is stopped and restarted again (ie. it picks up from the offset it left off last time it ran. we can tell this by looking at the log which gives the offsets it starts with and every time it commits) we know that somewhere zookeeper should be saving the committed offsets for the consumer. My understanding is that the zookeeper keeps track for the HL consumer, and not the kafka broker. Yet the "ls" commands that are supposed to show the offsets show null instead.
Should I be looking at a different place for accessing the offsets? (ultimately, I need to have a script that reports on the offsets for all the consumers.)
Very much appreciate any help or suggestion.
You should use get instead of ls. ls gets child nodes and in your case /consumers/mygrpid/offsets/mytopic/0 does not have children. But it has a value, so running get /consumers/mygrpid/offsets/mytopic/0 should show you something like this:
47
cZxid = 0x568
ctime = Tue Feb 03 19:08:10 EET 2015
mZxid = 0x568
mtime = Tue Feb 03 19:08:10 EET 2015
pZxid = 0x568
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 2
numChildren = 0
where 47 is the offset value.

MongoDB index creation goes past 100% and appears to loop forever

I have a MongoDB collection with ~5.5M records. My attempts to index it, whether on a single field or with a compoundIndex fail as the indexing process proceeds normally but then when it reaches 100% where I presume it should stop, it goes past 100% and just continues on. I've left it running for 10 hours but it never ended.
The fields I try to index on are longs or doubles.
I'm running the latest MongoDB version on x64 Windows.
Am I right to think that this is abnormal behaviour? Any ideas what I can do?
Wed Sep 05 10:22:37 [conn1] 415000000/5576219 7442%
Wed Sep 05 10:22:48 [conn1] 417000000/5576219 7478%
Wed Sep 05 10:22:59 [conn1] 419000000/5576219 7514%
Per helpful advice from mongodb-users list:
This was likely due to running out of disk space and getting the database corrupted due to that.
What I did is I cleared up disk space, then ran "mongodump --repair" and then "mongorestore".