kafka with jbod disks + what the max numbers of disks that we can set on kafka? - apache-kafka

we are planing to build 17 kafka machines
since we need a huge storage , we are thinking to use jbod disks for each kafka machine
so the plan is like this
number of kafka machines -17
kafka version - 2.7
number of disks on jbod - 44 disks ( when the size of each disk is 2.4T )
so just to give here more perspective from kafka side and kafka configuration
in server.properies file we need to set the logs.dir with 44 disks
based on that we are just thinking if huge number of disks like 44 , is maybe higher then threshold
actually we search a lot to find some useful post that talk about this
but without success
so lets summary:
what is the limit about number of disks ( jbod disks ) that we can connect to kafka machine?

Related

High CPU utilization KSQL db

We are running KSQL db server on kubernetes cluster.
Node config:
AWS EKS fargate
No of nodes: 1
CPU: 2 vCPU (Request), 4 vCPU (Limit)
RAM: 4 GB (Request), 8 GB (Limit)
Java heap: 3 GB (Default)
Data size:
We have ~11 source topic with 1 partition, some one them having 10k record few has more than 100k records. ~7 sink topic but to create those 7 sink topic have ~60 ksql table, ~38 ksql streams & ~64 persistent queries because of joins and aggregation. So heavy computation.
KSQLdb version: 0.23.1 and we are using confluent official KSQL docker image
The problem:
When running our KSQL script we are seeing spike in CPU to 350-360% and memory 20-30%. And when that happen kubernetes restarting the server instance. Which is resulting ksql-migration to fail.
Error:
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection
refused:
<deployment-name>.<namespace>.svc.cluster.local/172.20.73.150:8088
Error: io.vertx.core.VertxException: Connection was closed
We have 30 migration files, and each file has multiple table and stream creation.
And its always failing on v27.
What we have tried so far:
Running it alone. And in that case it pass with no error.
Increase initial CPU to 4 vCPU but no change in CPU utilization
Had 2 nodes with 2 partition in kafka, but that also had same issue with addition few data columns having no data.
So something is not right in our configuration or resource allocation.
What's the standard of deployment for KSQL in kubernetes? maybe its not meant for kubernetes.

SSD or HDD for Kafka Brokers? ( Using SSD for Kafka )

Kafka is fast because it uses sequential writing techniques on HDD.
If I use SSD for Kafka Brokers, do I get faster performance?
As far as I know, SSD works differently than HDD. And I think with SSD I wouldn't get sequential writes privileges. I'm worried that using SSD wouldn't be good for Kafka brokers.
My questions :
Is SSD better than HDD for Kafka Brokers?
Does the "sequential write technique" also apply to SSD?
SSD are best for Zookeeper servers, not the brokers.
If I use SSD for Kafka Brokers, do I get faster performance?
Honestly, that is up for you to benchmark for your use-cases
However, Kafka does sequential scans/writes, not random flash access of data (what SSDs are meant for), therefore spinning disks are preferred , regardless of claimed speeds of SSD
https://docs.confluent.io/current/kafka/deployment.html#disks
Also, disk pools (JBOD) and partition schemas are important, and ZFS seems to get good gains over XFS or ext4
I have run Kafka in production for 8 years with 1mio messages per second.
Spinning disks will only work as long as you can avoid lagging consumers. If you have too many, disk access will look like random I/O and spinning disk-based Kafka clusters will fail (tested for you).
Do not put Kafka on consumer drives, we tried, they die hard after about 1 year. Enterprise NVMe is awesome, if you can afford it. Currently experimenting with a 22 disk ssd raid0. 10gbit+ nics is a must.

Log Flush Policy for Kafka brokers in production systems

What are the appropriate values for the following Kafka broker properties for a production system?
log.flush.interval.messages
log.flush.interval.ms
I am seeing too many small IOs write requests in my st1 HDD drive which I would like to optimize. Will changing these properties help? What are the tradeoffs?
Also, How are these properties configured in any typical production system?
We have 5 Kafka brokers (r5.xlarge) with attached st1 HDD Drive. Our usual data input rate is around 2 Mbps during peak time and 700-800 Kbps during usual time.

Kafka Producer and Broker Throughput Limitations

I have configured a two node six partition Kafka cluster with a replication factor of 2 on AWS. Each Kafka node runs on a m4.2xlarge EC2 instance backed by an EBS.
I understand that rate of data flow from Kafka producer to Kafka broker is limited by the network bandwidth of producer.
Say network bandwidth between Kafka producer and broker is 1Gbps ( approx. 125 MB/s) and bandwidth between Kafka broker and storage ( between EC2 instance and EBS volume ) is 1 Gbps.
I used the org.apache.kafka.tools.ProducerPerformance tool for profiling the performance.
I observed that a single producer can write at around 90 MB/s to the broker when a message size is 100 bytes.( hence network is not saturated)
I also observed that disk write rate to EBS volume is around 120 MB/s.
Is this 90 MB/s due to some network bottleneck or is it a limitation of Kafka ? (forgetting batch size and compression etc. for simplicity )
Could this be due to the bandwidth limitation between broker and ebs volume?
I also observed that when two producers ( from two separate machines ) produce data, throughput of one producer dropped to around 60 MB/s.
What could be the reason for this? Why doesn't that value reach 90 MB/s ? Could this be due to the network bottleneck between broker and ebs volume?
What confuses me is that in both cases (single producer and two producers ) disk write rate to ebs stays around 120 MB/s ( closer to its upper limit ).
Thank you
I ran into the same issue as per my understanding, in first case one producer is sending data to two brokers (there is nothing else in the network) so you got 90 MB/s and each broker at 45MB/s (approx), but in the second case two producers are sending data to two brokers so from the producer perspective it is able to send data at 60 MB/s but from the broker perspective it is receiving data at 60MB/s. so you are actually able to push more data through kafka.
There are a couple things to consider:
There are separate disk and network limits that apply to both the instance and the volume.
You have to account for replication. If you have RF=2, the amount of write traffic taken by a single node is 2*(PRODUCER_TRAFFIC)/(PARTITION_COUNT) assuming even distribution of writes across partitions.

Kafka recommended system configuration

I'm expecting our influx in to Kafka to raise to around 2 TB/day over a period of time. I'm planning to setup a Kafka cluster with 2 brokers (each running on separate system). What is the recommended hardware configuration for handling 2 TB/day ?
To use as a base you could look here: https://docs.confluent.io/4.1.1/installation/system-requirements.html#hardware
You need to know the amount of messages you get per second/hour because this will determine the size of your cluster. For HD, it's not necessary to get SSD because the system will use RAM to store the data first. Still you could need quite speed hard disk to ensure that the flushing process of the queue will not slow your system.
I would also recommend to use 3 kafka broker and 3 or 4 zookeeper server too.