we have zookeeper cluster with 3 nodes
when we perform the following commands
echo stat | nc zookeeper_server01 2181 | grep Mode
echo stat | nc zookeeper_server02 2181 | grep Mode
echo stat | nc zookeeper_server03 2181 | grep Mode
we saw that zookeeper_server03 is the leader and other are the Mode: follower
but we noticed that every couple min the state is change and indeed after 4 min zookeeper_server01 became the leader and other are Mode: follower
again after 6 min zookeeper_server02 became a leader and so on
my Question is - dose this strange behavior is normal ?
I want to say that production Kafka cluster is using this zookeeper servers so , we are worry about that
We have two logical replication slots in our postgresql database (version-11) instance and we are using pgJDBC to stream data from these two slots.
We are ensuring that when we regularly send feedback and update the confirmed_flush_lsn (every 10 minutes) for both the slots to the same position. However
From our data we have seen that the restart_lsn movement of the two are not in sync and most of the time one of them lags too far behind
to hold the WAL files unnecessarily.
Here are some data points to indicate the problem
Thu Dec 10 05:37:13 CET 2020
slot_name | restart_lsn | confirmed_flush_lsn
--------------------------------------+---------------+---------------------
db_dsn_metadata_src_private | 48FB/F3000208 | 48FB/F3000208
db_dsn_metadata_src_shared | 48FB/F3000208 | 48FB/F3000208
(2 rows)
Thu Dec 10 13:53:46 CET 2020
slot_name | restart_lsn | confirmed_flush_lsn
-------------------------------------+---------------+---------------------
db_dsn_metadata_src_private | 48FC/2309B150 | 48FC/233AA1D0
db_dsn_metadata_src_shared | 48FC/233AA1D0 | 48FC/233AA1D0
(2 rows)
Thu Dec 10 17:13:51 CET 2020
slot_name | restart_lsn | confirmed_flush_lsn
-------------------------------------+---------------+---------------------
db_dsn_metadata_src_private | 4900/B4C3AE8 | 4900/94FDF908
db_dsn_metadata_src_shared | 48FD/D2F66F10 | 4900/94FDF908
(2 rows)
Though we are using setFlushLsn() and forceStatusUpdate for both the slot's stream regularly still the slot with name private is far behind the confirmed_flush_lsn and
slot with name shared is also behind with confirmed_flush_lsn but not too far. Since the restart_lsn is not moving fast enough, causing lot of issues with WAL log
file management and not allowing to delete them to free up disk space
How can this problem be solved? Are there any general guidelines to overcome this issue ?
We have seen another thread with similar question but no response there too.
WALs getting pilled up - restart_lsn of logical replication not moving in PostgreSQL
I am using the sample program published by pgJDBC here :
https://jdbc.postgresql.org/documentation/head/replication.html
to get streaming changes from the postgresql here.
I am using airflow 1.10.9 and celery worker. I have dags which run whenever task comes and it spins up new ec2 instance and it connects to RDS on the basis of logic but ec2 holds the connection even when there no task is running and it keeps holding connection until Auto scaling scales down the instance.
RDS Details -
Class : db.t3.xlarge
Engine : PostgreSQL
I have checked the RDS logs but no luck.
LOG: could not receive data from client: Connection reset by peer
here is RDS connections.
state | wait_event | wait_event_type | count
--------+---------------------+-----------------+-------
| AutoVacuumMain | Activity | 1
| BgWriterHibernate | Activity | 1
| CheckpointerMain | Activity | 1
idle | ClientRead | Client | 525
| LogicalLauncherMain | Activity | 1
| WalWriterMain | Activity | 1
active | | | 1
All the connections are from celery workers.
Any help is appreciated.
Following to the solution mentioned here kafka-mirror-maker-failing-to-replicate-consumer-offset-topic. I was able to start mirror maker without any error across DC1(Live Kafka cluster) and DC2(Backup Kafka cluster) clusters.
Looks like it is also able to sync __consumer_offsets topic across DC2 cluster form DC1 cluster.
Issue
If I close down consumer for DC1 and point same consumer(same group_id) to DC2 it reads the same messages again even though mirror maker is able sync offsets for this topic and partitions.
I can see that LOG-END-OFFSET is showing correctly but CURRENT-OFFSET is still pointing to old causing LAG.
Example
Mirror Maker is still running in DC2.
Before consumer shut down in DC1
//DC1 __consumer_offsets topic
+-----------------------------------------------------------------+
| TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG |
+-----------------------------------------------------------------+
| gs.suraj.test.1 0 10626 10626 0 |
| gs.suraj.test.1 2 10619 10619 0 |
| gs.suraj.test.1 1 10598 10598 0 |
+-----------------------------------------------------------------+
Stop consumer in DC1
Before consumer start up in DC2
//DC2 __consumer_offsets topic
+-----------------------------------------------------------------+
| TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG |
+-----------------------------------------------------------------+
| gs.suraj.test.1 0 9098 10614 1516 |
| gs.suraj.test.1 2 9098 10614 1516 |
| gs.suraj.test.1 1 9098 10615 1517 |
+-----------------------------------------------------------------+
Because of this lag, when I start same consumer in DC2 in reads 4549 messages again, which should not happen as it is already read an commited in DC1 and mirror maker have sync __consumer_offsets topic from DC1 to DC2
Please let me know if I am missing anything in here.
If you are using Mirror Maker 2.0 they say explicitly on the motivation that there is no support for exactly-once:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0#KIP-382:MirrorMaker2.0-Motivation
But they intend to do it in the future.
Kafka topic creation is failing in below scenarios:
Node is kafka cluster: 4
Replication factor: 4
Number of nodes up and running in cluster: 3
Below is the error:
./kafka-topics.sh --zookeeper :2181 --create --topic test_1 --partitions 1 --replication-factor 4
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Error while executing topic command : Replication factor: 4 larger than available brokers: 3.
[2018-10-31 11:58:13,084] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 4 larger than available brokers: 3.
Is it a valid behavior or some known issue in kafka?
If all the nodes in a cluster should be up and running always then what about failure tolerance?
upating json file for increasing the replication factor for already created topic:
$cat /tmp/increase-replication-factor.json
{"version":1,
"partitions":[
{"topic":"vHost_v81drv4","partition":0,"replicas":[4,1,2,3]},
{"topic":"vHost_v81drv4","partition":1,"replicas":[4,1,2,3]},
{"topic":"vHost_v81drv4","partition":2,"replicas":[4,1,2,3]},
{"topic":"vHost_v81drv4","partition":3,"replicas":[4,1,2,3]}
{"topic":"vHost_v81drv4","partition":4,"replicas":[4,1,2,3]},
{"topic":"vHost_v81drv4","partition":5,"replicas":[4,1,2,3]},
{"topic":"vHost_v81drv4","partition":6,"replicas":[4,1,2,3]},
{"topic":"vHost_v81drv4","partition":7,"replicas":[4,1,2,3]}
]}
When a new topic is created in Kafka, it is replicated N=replication-factor times across your brokers. Since you have 3 brokers up and running and replication-factor set to 4 the topic cannot be replicated 4 times and thus you get an error.
When creating a new topic you either need to ensure that all of your 4 brokers are up and running or set the replication factor to a smaller value in order to avoid failure on topic creation when one of your brokers is down.
In case you want to create topic with replication factor set to 4 while one broker is down, you can initially create the topic with replication-factor=3 and once your 4th broker is up and running you can modify the configuration of that topic and increase its replication factor by following the steps below (assuming you have a topic example with 4 partitions):
Create a increase-replication-factor.json file with this content:
{"version":1,
"partitions":[
{"topic":"example","partition":0,"replicas":[0,1,2,3]},
{"topic":"example","partition":1,"replicas":[0,1,2,3]},
{"topic":"example","partition":2,"replicas":[0,1,2,3]},
{"topic":"example","partition":3,"replicas":[0,1,2,3]}
]}
Then execute the following command:
kafka-reassign-partitions --zookeeper localhost:2181 --reassignment-json-file increase-replication-factor.json --execute
And finally you'd be able to confirm that your topic is replicated across the 4 brokers:
kafka-topics --zookeeper localhost:2181 --topic signals --describe
Topic:signals PartitionCount:4 ReplicationFactor:4 Configs:retention.ms=1000000000
Topic: signals Partition: 0 Leader: 2 Replicas: 0,1,2,3 Isr: 2,0,1,3
Topic: signals Partition: 1 Leader: 2 Replicas: 0,1,2,3 Isr: 2,0,1,3
Topic: signals Partition: 2 Leader: 2 Replicas: 0,1,2,3 Isr: 2,0,1,3
Topic: signals Partition: 3 Leader: 2 Replicas: 0,1,2,3 Isr: 2,0,1,3
Regarding high availability let me explain how Kafka works:
Every topic, is a particular stream of data (similar to a table in a database). Topics, are split into partitions (as many as you like) where each message within a partition gets an incremental id, known as offset as shown below.
Partition 0:
+---+---+---+-----+
| 0 | 1 | 2 | ... |
+---+---+---+-----+
Partition 1:
+---+---+---+---+----+
| 0 | 1 | 2 | 3 | .. |
+---+---+---+---+----+
Now a Kafka cluster is composed of multiple brokers. Each broker is identified with an ID and can contain certain topic partitions.
Example of 2 topics (each having 3 and 2 partitions respectively):
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 2 |
| Partition 1 |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 2 |
| |
| |
| Topic 2 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Note that data is distributed (and Broker 3 doesn't hold any data of topic 2).
Topics, should have a replication-factor > 1 (usually 2 or 3) so that when a broker is down, another one can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor set to 2 as shown below:
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| |
| |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 1 |
| Partition 1 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication-factor of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. Therefore, Apache-Kafka offers strong durability and fault tolerance guarantees.
Note about Leaders:
At any time, only one broker can be a leader of a partition and only that leader can receive and serve data for that partition. The remaining brokers will just synchronize the data (in-sync replicas). Also note that when the replication-factor is set to 1, the leader cannot be moved elsewhere when a broker fails. In general, when all replicas of a partition fail or go offline, the leader will automatically be set to -1.
This is a valid behavior. When creating a new topic all nodes should be up and running.
Confluence Replica placements - Initial placement
Only create the topic, make decision-based on current live brokers (manual create topic command);
All nodes must not be up and running while using this topic (after it is created)
Apache documentation about replication factor
The replication factor controls how many servers will replicate each message that is written. If you have a replication factor of 3 then up to 2 servers can fail before you will lose access to your data. We recommend you use a replication factor of 2 or 3 so that you can transparently bounce machines without interrupting data consumption.