Using Debezium MySQL connector with MSK shows "INVALID_REPLICATION_FACTOR" - apache-kafka

I'm using Debezium MySQL with MSK, very simple setup. Connection to MySQL (Aurora) is tested fine. Kafka topics creation, listing are both ok.
However, when I run the connector, after a lot of scrolling info I get
WARN [Producer clientId=xxx] Error while fetching metadata with correlation id 1 : {xxx.xxx=INVALID_REP
LICATION_FACTOR} (org.apache.kafka.clients.NetworkClient:1100)
A lot of them keeps showing up and the connector does not work properly.

After quite a while I found out the reason for this is that default replication factor setting on MSK follows Kafka best practice which is 3, but I only created 2 brokers.
The configuration stayed 3 and when connector tried to auto-create a topic with 3 replicas it fails. The strange thing is even when I manually created a topic with replication factor 2, the connector would throw the very same warning.
It seems that the internal topics are always attempted by Debezium connector.
Creating a new revision and set replication factor as 2 solved the problem.

Related

Using Amazon MSK and Debezium SQL Server Connector. Error while fetching metadata with correlation id 7 : {TestKafkaDB= UNKNOWN_TOPIC_OR_PARTITION}

I was trying to connect my RDS MS SQL server with Debezium SQL Server Connector to stream changes to Kafka Cluster on Amazon MSK.
I configured connector and Kafka Connect worker well run the Connect by
bin/connect-standalone.sh ../worker.properties connect/dbzmmssql.properties
Got WARN [Producer clientId=producer-1] Error while fetching metadata with correlation id 10 : {TestKafkaDB=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient:1031)
I've solved this problem and just want to share my possible solution with other fresher with Kafka.
TestKafkaDB=UNKNOWN_TOPIC_OR_PARTITION basically means the connector didn't find the a usable topic in Kafka broker. The reason I am facing this is the Kafka broker didn't automatically create a new topic for the stream.
To solving this, I changed Cluster Configuration in AWS MSK console, change auto.create.topics.enable from default false to true and update this configuration to the Cluster, then my problem solved.

Running two instances of MirrorMaker 2.0 halting data replication for newer topics

We tried below scenario using mirror-maker 2.0 and want to know if output of second scenario is expected.
Scenario 1.) We ran single mirror-maker 2.0 instance using the below properties and start command.
clusters=a,b
tasks.max=10
a.bootstrap.servers=kf-test-cluster-a:9092
a.config.storage.replication.factor=1
a.offset.storage.replication.factor=1
a.security.protocol=PLAINTEXT
a.status.storage.replication.factor=1
b.bootstrap.servers=kf-test-cluster-b:9092
b.config.storage.replication.factor=1
b.offset.storage.replication.factor=1
b.security.protocol=PLAINTEXT
b.status.storage.replication.factor=1
a->b.checkpoints.topic.replication.factor=1
a->b.emit.checkpoints.enabled=true
a->b.emit.hearbeats.enabled=true
a->b.enabled=true
a->b.groups=group1|group2|group3
a->b.heartbeats.topic.replication.factor=1
a->b.offset-syncs.topic.replication.factor=1
a->b.refresh.groups.interval.seconds=30
a->b.refresh.topics.interval.seconds=10
a->b.replication.factor=2
a->b.sync.topic.acls.enabled=false
a->b.topics=.*
Start command: /usr/bin/connect-mirror-maker.sh connect-mirror-maker.properties &
Verification: Created new topic "test" on source cluster(a), produced data to topic on source cluster and ran consumer on target-cluster(b),topic "a.test" to verify data replication.
Observation: Worked fine as expected.
Scenario 2.) Ran one more instance of MirrorMaker 2.0 using the same properties as mentioned above.
Start command: /usr/bin/connect-mirror-maker.sh connect-mirror-maker.properties &
Verification: Created one more "test2" topic on source cluster, produced data to topic on source cluster and ran consumer on target-cluster(b),topic "a.test2" to verify data replication.
Observation: MM2 was able to replicate the topic on the target cluster, a.test2 was present on target cluster b but consumer didn't get any record to consume.
On newer mirror-maker 2.0 instance logs, after topic replication, mirror-sourceconnector task had not restarted which was restarting in single instance after topic replication.
NOTE: There were no error logs seen.
I observed the same behavior, your messages are most likely replicated, you can verify this by checking your consumer group offset, the problem is most likely your lag offset is 0 meaning your consumer assumes all previous messages have been consumed. You can reset the offset or read from beginning.
Ideally, the checkpoint heartbeat should contain the latest offset but I currently find this to be empty even though starting with Kafka 2.7, checkpoint heartbeat replication should be automatic

INVALID_FETCH_SESSION_EPOCH errors with confluent HDFS sink connector

I have been trying to get the confluent HDFS sink connector to work, without success. I have a distributed kafka connect cluster (3 nodes) with version 5.2.2 of the confluent platform, talking to a 5-node kafka cluster running kafka 2.3.0.
I have an event generator writing events to a topic with 3 partitions and replication factor 1. I see the events appearing in HDFS, and everything looks correct.
However, for topics with a different number of partitions, or a different replication factor, nothing is written to HDFS. Instead I get errors that look like this:
Node 121 was unable to process the fetch request with (sessionId=378162138, epoch=7242): INVALID_FETCH_SESSION_EPOCH. (org.apache.kafka.clients.FetchSessionHandler:381)
These events appear in the kafka connect logs and it looks like one of these appears for each attempt to commit.
For a topic with 3 partitions, and replication factor 1, everything works.
For a topic with 3 partitions, and replication factor 2, nothing is written to HDFS.
For a topic with 50 partitions, and replication factor 1, nothing is written.
In fact, if I increase either the replication factor or the number of partitions, nothing is written to HDFS.
I can't seem to find the right configuration to make this work. What settings should I be paying attention to?

Apache kafka production cluster setup problems

We have been trying to set up a production level Kafka cluster in AWS Linux machines and till now we have been unsuccessful.
Kafka version:
2.1.0
Machines:
5 r5.xlarge machines for 5 Kafka brokers.
3 t2.medium zookeeper nodes
1 t2.medium node for schema-registry and related tools. (a Single instance of each)
1 m5.xlarge machine for Debezium.
Default broker configuration :
num.partitions=15
min.insync.replicas=1
group.max.session.timeout.ms=2000000
log.cleanup.policy=compact
default.replication.factor=3
zookeeper.session.timeout.ms=30000
Our problem is mainly related to huge data.
We are trying to transfer our existing tables in kafka topics using debezium. Many of these tables are quite huge with over 50000000 rows.
Till now, we have tried many things but our cluster fails every time with one or more reasons.
ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /brokers/topics/__consumer_offsets/partitions/0/state
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)..
Error 2:
] INFO [Partition xxx.public.driver_operation-14 broker=3] Cached zkVersion [21] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-12-12 14:07:26,551] INFO [Partition xxx.public.hub-14 broker=3] Shrinking ISR from 1,3 to 3 (kafka.cluster.Partition)
[2018-12-12 14:07:26,556] INFO [Partition xxx.public.hub-14 broker=3] Cached zkVersion [3] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-12-12 14:07:26,556] INFO [Partition xxx.public.field_data_12_2018-7 broker=3] Shrinking ISR from 1,3 to 3 (kafka.cluster.Partition)
Error 3:
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=888665879, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
Some more errors :
Frequent disconnections among broker which probably is the reason
behind nonstop shrinking and expanding ISRs with no auto recovery.
Schema registry gets timed out. I don't know how is schema registry even affected. I don't see too much load on that server. Am I missing something? Should I use a Load balancer for multiple instances of schema Registry as failover?. The topic __schemas has just 28 messages in it.
The exact error message is RestClientException: Register operation timed out. Error code: 50002
Sometimes the message transfer rate is over 100000 messages per second, sometimes it drops to 2000 messages/second? message size could cause this?
In order to solve some of the above problems, we increased the number of brokers and increased zookeeper.session.timeout.ms=30000 but I am not sure if it actually solved the our problem and if it did, how?.
I have a few questions:
Is our cluster good enough to handle this much data.
Is there anything obvious that we are missing?
How can I load test my setup before moving to the production level?
What could cause the session timeouts between brokers and the schema registry.
Best way to handle the schema registry problem.
Network Load on one of our Brokers.
Feel free to ask for any more information.
Please Use The latest official version of the Confluent for you cluster.
Actually you can make it better by increasing the number of partitions of your topics and also increasing the tasks.max(of course in your sink connectors) more than 1 in your connector to work more concurrently and faster.
Please increase the number of Kafka-Connect topics and use Kafka-Connect distributed mode to increase the High Availability of your Kafka-connect cluster. you can make it by setting the number of replication factor in the Kafka-Connectand Schema-Registry config for example:
config.storage.replication.factor=2
status.storage.replication.factor=2
offset.storage.replication.factor=2
Please set the topic compression to snappy for your large tables. it will increase the throughput of the topics and this helps the Debezium connector to work faster and also do not use JSON Converter it's recommended to use Avro Convertor!
Also please use load-balancer for your Schema Registry
For testing the cluster you can create a connector with only one table (I mean a large table!) with the database.whitelist and set snapshot.mode to initial
And About the schema-registry! Schema-registry user both Kafka and Zookeeper with setting these configs:
bootstrap.servers
kafkastore.connection.url
And this is the reason of your downtime of the shema-registry cluster

Consume from Kafka 0.10.x topic using Storm 0.10.x (KafkaSpout)

I am not sure if this a right question to ask in this forum. We were consuming from a Kafka topic by Storm using the Storm KafkaSpout connector. It was working fine till now. Now we are supposed to connect to a new Kafka cluster having upgraded version 0.10.x from the same Storm env which is running on version 0.10.x.
From storm documentation (http://storm.apache.org/releases/1.1.0/storm-kafka-client.html) I can see that storm 1.1.0 is compatible with Kafka 0.10.x onwards supporting the new Kafka consumer API. But in that case I won't be able to run the topology in my end (please correct me if I am wrong).
Is there any work around for this?
I have seen that even if the New Kafka Consumer API has removed ZooKeeper dependency but we can still consume message from it using the old Kafka-console-consumer.sh by passing the --zookeeper flag instead of new –bootstrap-server flag (recommended). I run this command from using Kafka 0.9 and able to consume from a topic hosted on Kafka 0.10.x
When we are trying to connect getting the below exception:
java.lang.RuntimeException: java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /brokers/topics/mytopic/partitions
at storm.kafka.DynamicBrokersReader.getBrokerInfo(DynamicBrokersReader.java:81) ~[stormjar.jar:?]
at storm.kafka.trident.ZkBrokerReader.<init>(ZkBrokerReader.java:42) ~[stormjar.jar:?]
But we are able to connect to the remote ZK server and validated that the path exists:
./zkCli.sh -server remoteZKServer:2181
[zk: remoteZKServer:2181(CONNECTED) 5] ls /brokers/topics/mytopic/partitions
[3, 2, 1, 0]
As we can see above that it's giving us expected output as the topic has 4 partitions in it.
At this point have the below questions:
1) Is it at all possible to connect to Kafka 0.10.x using Storm version 0.10.x ? Has one tried this ?
2) Even if we are able to consume, do we need to make any code change in order to retrieve the message offset in case of topology shutdown/restart. I am asking this as we will passing the Zk cluster details instead of the brokers info as supported in old KafkaSpout version.
Running out of options here, any pointers would be highly appreciated
UPDATE:
We are able to connect and consume from the remote Kafka topic while running it locally using eclipse. To make sure storm does not uses the in-memory zk we have used the overloaded constructor LocalCluster("zkServer",port), it's working fine and we can see the data coming. This lead us to conclude that version compatibility might not be the issue here.
However still no luck when deployed the topology in cluster.
We have verified the connectivity from storm box to zkservers
The znode seems fine also ..
At this point really need some pointers here, what could possibly be wrong with this and how do we debug that? Never worked with Kafka 0.10x before so not sure what exactly are we missing.
Really appreciate some help and suggestions
Storm 0.10x is compatible with Kafka 0.10x . We can still uses the old KafkaSpout that depends on zookeeper based offset storage mechanism.
The connection loss exception was coming as we were trying to reach a remote Kafka cluster that does not allow/accept connection from our end. We need to open specific firewall port so that the connection can be established. It seems that while running topology is cluster mode all the supervisor nodes should be able to talk to the zookeeper, so the firewall should be open for each one of them.