I run MirrorMaker2 with the high-level driver, as documented here, with ./bin/connect-mirror-maker.sh mm2.properties running in 3 pods in a k8s deployment.
The mm2.properties file looks like this:
clusters = source, dest
source.bootstrap.servers = ***:9092
dest.bootstrap.servers = ***:9092
source->dest.enabled = true
dest->source.enabled = false
source->dest.topics = event\.PROD\.some_id.*
replication.factor=3
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
sync.topic.acls.enabled = false
This works fine, with all topics matching the event\.PROD\.some_id.* regex being replicated.
Now, when I need to add other topic the whitelisting, I expected to be able to simply scale everything down, update the regex, and scale everything up again.
When I update the whitelist regex to source->dest.topics = event\.PROD\.(some_id|another_id).* , the topics matching "another_id" are created in the dest cluster, but no data is replicated, and mirrormaker seems to be lost commiting offsets:
[2020-05-28 20:33:19,496] INFO WorkerSourceTask{id=MirrorHeartbeatConnector-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:424)
[2020-05-28 20:33:19,496] INFO WorkerSourceTask{id=MirrorHeartbeatConnector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:441)
[2020-05-28 20:33:19,499] INFO WorkerSourceTask{id=MirrorHeartbeatConnector-0} Finished commitOffsets successfully in 3 ms (org.apache.kafka.connect.runtime.WorkerSourceTask:523)
Is this a limitation of the high level driver, or am I doing something wrong? From my understanding, being able to dynamically add topics to the whitelist was one of the motivations for MM2.
I am playing with mmv2 as well. Can you try setting these configurations? I had to enable the sync.topic.configs.enabled parameter so my mmv2 would detect the new topics and their data.
refresh.topics.enabled = true
sync.topic.configs.enabled = true
refresh.topics.interval.seconds = 10
Pd.- I am adding my reply as an answer because I wanted to paste come configs.
Related
I am attempting to use mirrormaker 2 to replicate data between AWS Managed Kafkas (MSK) in 2 different AWS regions - one in eu-west-1 (CLOUD_EU) and the other in us-west-2 (CLOUD_NA), both running Kafka 2.6.1. For testing I am currently trying just to replicate topics 1 way, from EU -> NA.
I am starting a mirrormaker connect cluster using ./bin/connect-mirror-maker.sh and a properties file (included)
This works fine for topics with small messages on them, but one of my topic has binary messages up to 20MB in size. When I try to replicate that topic I get an error every 30 seconds
[2022-04-21 13:47:05,268] INFO [Consumer clientId=consumer-29, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: {}. (org.apache.kafka.clients.FetchSessionHandler:481)
org.apache.kafka.common.errors.DisconnectException
When logging in DEBUG to get more information we get
[2022-04-21 13:47:05,267] DEBUG [Consumer clientId=consumer-29, groupId=null] Disconnecting from node 2 due to request timeout. (org.apache.kafka.clients.NetworkClient:784)
[2022-04-21 13:47:05,268] DEBUG [Consumer clientId=consumer-29, groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=consumer-29, correlationId=35) due to node 2 being disconnected (org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient:593)
It gets stuck in a loop constantly disconnecting with request timeout every 30s and then trying again.
Looking at this, I suspect that the problem is the request.timeout.ms is on the default (30s) and it times out trying to read the topic with many large messages.
I followed the guide at https://github.com/apache/kafka/tree/trunk/connect/mirror to attempt to configure the consumer properties, however, no matter what I set, the timeout for the consumer remains fixed at the default, confirmed both by kafka outputting its config in the log and by timing how long between the disconnect messages. e.g. I set:
CLOUD_EU.consumer.request.timeout.ms=120000
In the properties that I start MM-2 with.
based on various guides I have found while looking at this, I have also tried
CLOUD_EU.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.override.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.override.request.timeout.ms=120000
None of which have worked.
How can I change the consumer request.timeout setting? The log is approx 10,000 lines long, but everywhere where the ConsumerConfig is logged out it logs request.timeout.ms = 30000
Properties file I am using:
# specify any number of cluster aliases
clusters = CLOUD_EU, CLOUD_NA
# connection information for each cluster
CLOUD_EU.bootstrap.servers = kafka.eu-west-1.amazonaws.com:9092
CLOUD_NA.bootstrap.servers = kafka.us-west-2.amazonaws.com:9092
# enable and configure individual replication flows
CLOUD_EU->CLOUD_NA.enabled = true
CLOUD_EU->CLOUD_NA.topics = METRICS_ATTACHMENTS_OVERSIZE_EU
CLOUD_NA->CLOUD_EU.enabled = false
replication.factor=3
tasks.max = 1
############################# Internal Topic Settings #############################
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
############################ Kafka Settings ###################################
# CLOUD_EU cluster over writes
CLOUD_EU.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.session.timeout.ms=150000
We have this issue that when Kafka brokers must be taken offline, no consumer service has any idea about that and keeps running.
We tried listing consumers in the new Kafka instance, and saw no existing consumer listed there. All consumers listed are those newly created.
We had to manually terminate all existing consumer services which is not convenient every time we hit this issue.
Question - How does a consumer know it is no longer listed in the Kafka cluster so it should terminate itself?
P.S. We use Spring Kafka.
1 -- To Check Clusters & Replica status ?
Check Kafka cluster all broker status
$ zookeeper-shell.sh localhost:9001 ls /brokers/ids
Check Kafka cluster Specific broker status
$ zookeeper-shell.sh localhost:9001 get /brokers/ids/<id>
specific to replica_unavailability check
$ kafka-check --cluster-type=sample_type replica_unavailability
For first broker check
$ kafka-check --cluster-type=sample_type --broker-id 3 replica_unavailability --first-broker-only
Any partitions replicas not available
$ kafka-check --cluster-type=sample_type replica_unavailability
Checking offline partitions
$ kafka-check --cluster-type=sample_type offline
2 -- Code sample to send/auto-shutdown
2 custom options to do handle the shutdown using a kill-message,
do it gracefully by sending a kill-message before taking down
brokers or topics.
Option 1: Consider an in-band message/signal - i.e. send a “kill” message pertaining to topics/brokers consumer is listening to as it follows the offset order on the topic-partition
Option 2: make the consumer listen to 2 topics for e.g. “topic” and “topic_kill”
The difference between the 2 options above, is that the first version is comes in the the order it was sent, consider that there maybe blocking messages maybe waiting, depending on your implementation, to be consumed before that “kill message”.
While, the second version allows kill-signal to arrive independently without being blocked out of band, this is a nicer & reusable architectural pattern, with a clear separation between data topic and signaling.
Code Sample a) producer sending the kill-message & b) consumer to recieve and handle the shutdown
// Producer -- modify and adapt as needed
import json
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['0.0.0.0:<my port number>'],
key_serializer=lambda m: m.encode('utf8'),
value_serializer=lambda m: json.dumps(m).encode('utf8'))
def send_kill(topic: str, partitions: [int]):
for p in partitions:
producer.send(topic, key='kill', partition=p)
producer.flush()
// Consumer to accept a kill-message -- please modify and adapt as needed
import json
from kafka import KafkaConsumer
from kafka.structs import OffsetAndMetadata, TopicPartition
consumer = KafkaConsumer(bootstrap_servers=['0.0.0.0:<my port number>'],
key_deserializer=lambda m: m.decode('utf8'),
value_deserializer=lambda m: json.loads(m.decode('utf8')),
auto_offset_reset="earliest",
group_id='1')
consumer.subscribe(['topic'])
for msg in consumer:
tp = TopicPartition(msg.topic, msg.partition)
offsets = {tp: OffsetAndMetadata(msg.offset, None)}
if msg.key == "kill":
consumer.commit(offsets=offsets)
consumer.unsuscribe()
exit(0)
# do your work...
consumer.commit(offsets=offsets)
We are testing DR Scenario for kafka. we have 2 kafka cluster in separate region. We are using MirrorMaker2 to replicate the topics and messages.
Topics and messages are able to replicate. But we are observing offset is not replicating.
e.g
produced 10 messages from producuder pointed to kafka region 1.
Consumed 5 messages on from consumer pointed to kafka region 1
stop consumer pointed to region1
start consumer pointed to region2
consume the message
here expectation is region 2 consumer should consume from offset 6
but it starts consuming from offset 0
below is property file
clusters = primary, secondary
# primary cluster information
primary.bootstrap.servers = test1-primary.com:9094,test2-primary.com.apttuscloud.io:9094,test3-primary.com:9094
primary.security.protocol= SASL_SSL
primary.ssl.truststore.password= dummypassword
primary.ssl.truststore.location= /opt/bitnami/kafka/config/certs/kafka.truststore.jks
primary.ssl.keystore.password= dummypassword
primary.ssl.keystore.location= /opt/bitnami/kafka/config/certs/kafka.keystore.jks
primary.ssl.endpoint.identification.algorithm=
primary.sasl.mechanism= PLAIN
primary.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="dummyuser" password="dummypassword";
# secondary cluster information
secondary.bootstrap.servers = test1-secondary.com:9094,test2-secondary.com.apttuscloud.io:9094,test3-secondary.com:9094
secondary.security.protocol= SASL_SSL
secondary.ssl.truststore.password= dummypassword
secondary.ssl.truststore.location= /opt/bitnami/kafka/config/certs/kafka.truststore.jks
secondary.ssl.keystore.password= dummypassword
secondary.ssl.keystore.location= /opt/bitnami/kafka/config/certs/kafka.keystore.jks
secondary.ssl.endpoint.identification.algorithm=
secondary.sasl.mechanism=PLAIN
secondary.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="dummyuser" password="dummypassword";
# Topic Configuration
primary->secondary.enabled = true
primary->secondary.topics = .*
secondary->primary.enabled = true
secondary->primary.topics = .*
############################# Internal Topic Settings #############################
# The replication factor for mm2 internal topics "heartbeats", "B.checkpoints.internal" and
# "mm2-offset-syncs.B.internal"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3
checkpoints.topic.replication.factor= 3
heartbeats.topic.replication.factor= 3
offset-syncs.topic.replication.factor= 3
# The replication factor for connect internal topics "mm2-configs.B.internal", "mm2-offsets.B.internal" and
# "mm2-status.B.internal"
# For anything other than development testing, a value greater than 1 is recommended to ensure availability such as 3.
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
replication.factor = 3
refresh.topics.enabled = true
sync.topic.configs.enabled = true
refresh.topics.interval.seconds = 10
topics.blacklist = .*[\-\.]internal, .*\.replica, __consumer_offsets
groups.blacklist = console-consumer-.*, connect-.*, __.*
primary->secondary.emit.heartbeats.enabled = true
primary->secondary.emit.checkpoints.enabled = true
Please note some confedentilal values are placed with dummy values
Regards,
Narendra Jadhav
With MirrorMaker 2.5, when moving consumers between clusters, offsets are not automatically translated.
So upon starting consumers on another cluster, consumers need to use RemoteClusterUtils.translateOffsets() to find their offsets in this cluster.
In 2.7 (expected November 2020), you can have MirrorMaker 2 automatically translate offsets, see https://cwiki.apache.org/confluence/display/KAFKA/KIP-545%3A+support+automated+consumer+offset+sync+across+clusters+in+MM+2.0
I'm working with ksql from quite some time. Kafka cluster if of 3 nodes. I've been using udf as well and all looks good until I stop the servers and start them again.
On server start I'm seeing the following in the logs:
[2019-04-03 11:29:54,381] ERROR Exception encountered running command: A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required. KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1).. Retrying in 5000 ms (io.confluent.ksql.util.RetryUtil:80)
[2019-04-03 11:29:54,381] ERROR Stack trace: io.confluent.ksql.exception.KafkaTopicExistsException: A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required. KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1).
at io.confluent.ksql.services.TopicValidationUtil.validateTopicProperties(TopicValidationUtil.java:51)
at io.confluent.ksql.services.TopicValidationUtil.validateTopicProperties(TopicValidationUtil.java:35)
at io.confluent.ksql.services.KafkaTopicClientImpl.validateTopicProperties(KafkaTopicClientImpl.java:292)
at io.confluent.ksql.services.KafkaTopicClientImpl.createTopic(KafkaTopicClientImpl.java:76)
at io.confluent.ksql.planner.plan.KsqlStructuredDataOutputNode.createSinkTopic(KsqlStructuredDataOutputNode.java:244)
at io.confluent.ksql.planner.plan.KsqlStructuredDataOutputNode.buildStream(KsqlStructuredDataOutputNode.java:146)
at io.confluent.ksql.physical.PhysicalPlanBuilder.buildPhysicalPlan(PhysicalPlanBuilder.java:106)
at io.confluent.ksql.QueryEngine.buildPhysicalPlan(QueryEngine.java:113)
at io.confluent.ksql.KsqlEngine$EngineExecutor.execute(KsqlEngine.java:625)
at io.confluent.ksql.KsqlEngine$EngineExecutor.access$800(KsqlEngine.java:577)
at io.confluent.ksql.KsqlEngine.execute(KsqlEngine.java:247)
at io.confluent.ksql.rest.server.computation.StatementExecutor.startQuery(StatementExecutor.java:277)
at io.confluent.ksql.rest.server.computation.StatementExecutor.executeStatement(StatementExecutor.java:191)
at io.confluent.ksql.rest.server.computation.StatementExecutor.handleStatementWithTerminatedQueries(StatementExecutor.java:167)
at io.confluent.ksql.rest.server.computation.StatementExecutor.handleRestore(StatementExecutor.java:101)
at io.confluent.ksql.rest.server.computation.CommandRunner.lambda$null$0(CommandRunner.java:139)
at io.confluent.ksql.util.RetryUtil.retryWithBackoff(RetryUtil.java:63)
at io.confluent.ksql.util.RetryUtil.retryWithBackoff(RetryUtil.java:36)
at io.confluent.ksql.rest.server.computation.CommandRunner.lambda$processPriorCommands$1(CommandRunner.java:135)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at io.confluent.ksql.rest.server.computation.CommandRunner.processPriorCommands(CommandRunner.java:134)
at io.confluent.ksql.rest.server.KsqlRestApplication.buildApplication(KsqlRestApplication.java:414)
at io.confluent.ksql.rest.server.KsqlServerMain.createExecutable(KsqlServerMain.java:80)
at io.confluent.ksql.rest.server.KsqlServerMain.main(KsqlServerMain.java:42)
(io.confluent.ksql.util.RetryUtil:84)
Though I've stopped/terminated all the queries, the log prints all the commands I've executed from the beginning for my testing till data, including create, select, drop. I've pulled out the .jar(UDF) from /ext folder and the server started, though the log prints udf function(i'm using) not available.
This is my ksql-server.properties:
bootstrap.servers=hostname:9092
service.id=cyan_ksql
commit.interval.ms=5000
cache.max.bytes.buffering=20000000
num.stream.threads=10
fail.on.deserialization.error=false
listeners=http://localhost:8088
ksql.extension.dir=/opt/ksql-master/ext/
Going nuts with the error. I'm deleting the topic and somehow its recreated. Someone please help.
Check out the error:
A Kafka topic with the name 'czxcorp-structured-data-enriched' already exists, with different partition/replica configuration than required.
KSQL expects 4 partitions (topic has 9), and 1 replication factor (topic has 1)
If you've deleted the topic then either
it didn't actually get deleted
it got deleted and something else recreated it with nine partitions and your erroring KSQL query has not specified an override (WITH (PARTITIONS=9) to the default four
another KSQL command is creating it ahead of the one that errors out and your erroring KSQL query has not specified an override (WITH (PARTITIONS=9) to the default four
If you want to blow away your state and start from scratch, simply change your ksql.service.id which will cause KSQL to use a new command topic (which is what get replayed when you restart the process)
I have two kinds of log entries in server.log
First kind:
WARN Resetting first dirty offset of __consumer_offsets-6 to log start offset 918 since the checkpointed offset 903 is invalid. (kafka.log.LogCleanerManager$)
Second kind:
INFO [TransactionCoordinator id=3] Initialized transactionalId Source: AppService Kafka consumer -> Not empty string filter -> CDMEvent mapper -> (NonNull CDMEvent filter -> Map -> Sink: Kafka CDMEvent producer, Nullable CDMEvent filter -> Map -> Sink: Kafka Error producer)-bddeaa8b805c6e008c42fc621339b1b9-2 with producerId 78004 and producer epoch 23122 on partition __transaction_state-45 (kafka.coordinator.transaction.TransactionCoordinator)
I have found some suggestion that mentions that removing the checkpoint file might help:
https://medium.com/#anishekagarwal/kafka-log-cleaner-issues-80a05e253b8a
"What we gathered was to:
stop the broker
remove the log cleaner checkpoint file
( cleaner-offset-checkpoint )
start the broker
that solved the problem for us."
Is it safe to try that with all checkpoint files (cleaner-offset-checkpoint, log-start-offset-checkpoint, recovery-point-offset-checkpoint, replication-offset-checkpoint) or is it not recommendable at all with any of them?
I have stopped each broker and moved cleaner-offset-checkpoint to a backup location and started it without that file, brokers neatly started, deleted a lot of excessive segments and they don't log:
WARN Resetting first dirty offset of __consumer_offsets to log start offset since the checkpointed offset is invalid
any more, obviously, this issue/defect https://issues.apache.org/jira/browse/KAFKA-6266 is not solved yet, even in 2.0. 2. However, that didn't compact the consumer offset according to expectations, namely offsets.retention.minutes default is 10080 (7 days), and I tried to set it explicitely to 5040, but it didn't help, still there are messages more than one month old, since log.cleaner.enable is by default true, they should be compacted, but they are not, the only possible try is to set the cleanup.policy to delete again for the __consumer_offsets topic, but that is the action that triggered the problem, so I am a bit reluctant to do that. The problem that I described here No Kafka Consumer Group listed by kafka-consumer-groups.sh is also not resolved by that, obviously there is something preventing kafka-consumer-groups.sh to read the __consumer_offsets topic (when issued with --bootstrap-server option, otherwise it reads it from zookeeper) and display results, that's something that Kafka Tool does without problem, and I believe these two problems are connected.
And the reason why I think that topic is not compacted, is because it has messages with exactly the same key (and even timestamp), older than it should, according to broker settings. Kafka Tool also ignores certain records and doesn't interpret them as Consumer Groups in that display. Why kafka-consumer-groups.sh ignores all, that is probably due to some corruption of these records.