We are running an RDS Postgres cluster v 11.16 with a primary and 4 replicas. Replica R1 has a larger than usual replication lag than others. If R2-R4 have an average lag of 15 ms, R1 has a lag of 400 ms. All replicas are in the same region as the primary. All replicas have the same instance size. Requests are served in a round-robin manner to all replicas. Amazon Route 53 DNS is being used to route requests equally to all replicas. statement_timeout is set to 1 min across all replicas. What else should I be looking at? (edited)
Related
We are running KSQL db server on kubernetes cluster.
Node config:
AWS EKS fargate
No of nodes: 1
CPU: 2 vCPU (Request), 4 vCPU (Limit)
RAM: 4 GB (Request), 8 GB (Limit)
Java heap: 3 GB (Default)
Data size:
We have ~11 source topic with 1 partition, some one them having 10k record few has more than 100k records. ~7 sink topic but to create those 7 sink topic have ~60 ksql table, ~38 ksql streams & ~64 persistent queries because of joins and aggregation. So heavy computation.
KSQLdb version: 0.23.1 and we are using confluent official KSQL docker image
The problem:
When running our KSQL script we are seeing spike in CPU to 350-360% and memory 20-30%. And when that happen kubernetes restarting the server instance. Which is resulting ksql-migration to fail.
Error:
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection
refused:
<deployment-name>.<namespace>.svc.cluster.local/172.20.73.150:8088
Error: io.vertx.core.VertxException: Connection was closed
We have 30 migration files, and each file has multiple table and stream creation.
And its always failing on v27.
What we have tried so far:
Running it alone. And in that case it pass with no error.
Increase initial CPU to 4 vCPU but no change in CPU utilization
Had 2 nodes with 2 partition in kafka, but that also had same issue with addition few data columns having no data.
So something is not right in our configuration or resource allocation.
What's the standard of deployment for KSQL in kubernetes? maybe its not meant for kubernetes.
I am setting up a Confluent Kafka Cluster (Community) with 3 Zookeper and 5 Kafka Broker nodes .
The requirement is that we should be able to continue in live environment even if 2 broker nodes are down.
What should be the recommended
-replication factor ,
-in sync replica
for topics with 50 partitions.
In most case the suggested replication factor is 3 . What would be the impact if we increase that to 5 in the mentioned cluster configuration
Setting the replication factor to 5 would mean that all partitions exist on all brokers in the cluster. If two brokers are down, then the replication factor requirement is no longer met and your topics will be under-replicated (should give a warning).
min.insync.replicas should then be set to 3 (or less), otherwise producing a message with acks = all would fail. Producing a message with acks set to 1, 2, or 3 would also work on higher values of min.insync.replicas.
Also note that while two nodes are down, you can't create new topics with a replication factor of 5 (also see KIP-409).
I have configured a two node six partition Kafka cluster with a replication factor of 2 on AWS. Each Kafka node runs on a m4.2xlarge EC2 instance backed by an EBS.
I understand that rate of data flow from Kafka producer to Kafka broker is limited by the network bandwidth of producer.
Say network bandwidth between Kafka producer and broker is 1Gbps ( approx. 125 MB/s) and bandwidth between Kafka broker and storage ( between EC2 instance and EBS volume ) is 1 Gbps.
I used the org.apache.kafka.tools.ProducerPerformance tool for profiling the performance.
I observed that a single producer can write at around 90 MB/s to the broker when a message size is 100 bytes.( hence network is not saturated)
I also observed that disk write rate to EBS volume is around 120 MB/s.
Is this 90 MB/s due to some network bottleneck or is it a limitation of Kafka ? (forgetting batch size and compression etc. for simplicity )
Could this be due to the bandwidth limitation between broker and ebs volume?
I also observed that when two producers ( from two separate machines ) produce data, throughput of one producer dropped to around 60 MB/s.
What could be the reason for this? Why doesn't that value reach 90 MB/s ? Could this be due to the network bottleneck between broker and ebs volume?
What confuses me is that in both cases (single producer and two producers ) disk write rate to ebs stays around 120 MB/s ( closer to its upper limit ).
Thank you
I ran into the same issue as per my understanding, in first case one producer is sending data to two brokers (there is nothing else in the network) so you got 90 MB/s and each broker at 45MB/s (approx), but in the second case two producers are sending data to two brokers so from the producer perspective it is able to send data at 60 MB/s but from the broker perspective it is receiving data at 60MB/s. so you are actually able to push more data through kafka.
There are a couple things to consider:
There are separate disk and network limits that apply to both the instance and the volume.
You have to account for replication. If you have RF=2, the amount of write traffic taken by a single node is 2*(PRODUCER_TRAFFIC)/(PARTITION_COUNT) assuming even distribution of writes across partitions.
I am setting up a kafka+zookeeper cluster. Let's say I want 3 kafka brokers. I am wondering if I can setup 3 machines with kafka on them and then run the zookeeper cluster on the same nodes. So each machine has a kafka+zookeeper node in the cluster, instead of having 3 machines for kafka and 3 machines for zookeeper (6 in total).
What are the advantages and disadvantages? These machines will most probably be dedicated to running kafka/zookeeper. I am thinking if I can reduce costs a bit without sacrificing performance.
We have been running zookeeper and kafka broker on the same node in production environment for years without any problems. The cluster is running at very very high qps and IO traffics, so I dare say that our experience suits most scenarios.
The advantage is quite simple, which is saving machines. Kafka brokers are IO-intensive, while zookeeper nodes don't cost too much disk IO as well as CPU. So they won't disturb each other in most occasions.
But do remember to keep watching at your CPU and IO(not only disk but also network) usages, and increase cluster capacity before they reach bottleneck.
I don't see any disadvantages because we have very good cluster capacity planning.
It makes sense to collocate them when Kafka cluster is small, 3-5 nodes. But keep in mind that it is a colocation of two applications that are sensitive to disk I/O. The workloads and how chatty they are with local Zk's also plays an important role here, especially from page cache memory usage perspective.
Once Kafka cluster grows to a dozen or more nodes, collocation of Zk’s accordingly on each node will create quorum overheads(like slower writes, more nodes in quorum checks), so a separate Zk cluster has to be in place.
Overall, if from the start Kafka cluster usage is low and you want to save some costs, then it is reasonable to start them collocated, but have a migration strategy for setting up a separate Zk cluster to not be caught of guard once Kafka cluster has to be scaled horizontally.
My understanding is that Kafka replication requires quorum server setup which is 2*[replica factor] + 1.
I however managed to create a topic with replica factor 2 with 2 servers and it does seem to work.
Why is this possible at all? Does it have any side affects by using 2 servers only?