Caveats of scaling a Kafka cluster vertically vs horizontally?

Caveats of scaling a Kafka cluster vertically vs horizontally? - apache-kafka

We are planning to build a multi TB Kafka Cluster.
From LinkedIn presentations, which are supposed to handle the largest Kafka cluster in the world, it seems like they are using a few pretty large servers.
We are planning to go the other way: Launch a lot of small Kafka brokers handling a few GB each.
What are the pros and cons of scaling vertically vs horizontally with Kafka? e.g for 50TB, having 5 brokers handling 10TB each, or 5000 brokers handling 10GB each.
Those numbers are made up.
ps: Maintaining 5 or 5000 servers for us has the same operational cost as it's all automated.

My recommendation would be to go with 5 brokers with 10TB each, with 3 redundant copies of the data (RF3). Kafka brokers generate a lot of crosstalk/chatter between them, so it's best to minimize the network overhead as well as operational and even cognitive overhead when there's issues.
You mention that operational cost is all the same to you. In my experience, it's never that simple. There's setup time, configuration for 5000 different machines, network traffic, etc. And even if it's all automated, 5000 servers will have hardware issues, on average, at 1000x the rate of 5 servers, so if you expect 1% of the servers to fail per year, you'll have brokers failing almost weekly. Having large servers doesn't guarantee no hardware failures, but the likelihood is less.

Related

ksqlDB recommendations for deploying large set of queries

I am running a ksqlDB streaming application that consists of a large number of queries (>60 queries), including many joins and aggregations. My data comes from various sources, and requires plenty of manipulation to produce the desired processed data, hence the large number of queries. I've run this set of queries on a single machine, using interactive mode, and it produces the right results. But I observe an increasing consumer lag when I increase the amount of data fed into the application.
I read on ksqlDB's Capacity Planning page that I can scale by adding more servers, which is what I plan to do.
Under Important Sizing Factors, it's also stated that "You should avoid running a large number of queries on one ksqlDB cluster. Instead, use interactive mode to play with your data and develop sets of queries that function together. Then, run these in their own headless cluster." However, I am unsure how to do this- my queries are all dependent on each other.
Does anyone have any general recommendations on how to deploy a large number of interdependent ksql queries? As an added requirement, the data is refreshed each day and is independent for the each new day, so I need to do some sort of refresh of the queries each day.

I think that's just a recommendation if you can group queries that depend each other, and then split those groups into headless mode servers.
Another way, if you use interactive mode, is to partitioned your topics and add more ksql servers to your cluster. This will allow ksql to split the workload across the cluster, each server consuming and processing one partition. Say you have 4 partitions per topic and 2 servers, then you'll have 1 server processing 2 partitions and another server other 2 partitions. This should decrease the workload on each server.
Another improvement is to reduce the number of streams threads. Each query you create runs with 4 kafka streams threads by default. The more number of threads, the more parallel work is done in the server. With a large number of queries, performance decreases and lag is incremented. Try with 1 thread and see if that works. Set ksql.streams.num.stream.threads=1 in the ksql-server.properties to configure it.

Hardware requirement for apache kafka

I am building a production environment where I will be having Apache Kafka. I want to know the best hardware combination to have for better performance. I will be having 5000 transactions/second.

You would need to provide some more details regarding your use-case like average size of messages etc. but here's my 2 cents anyway:
Confluent's documentation might shed some light:
CPUs Most Kafka deployments tend to be rather light on CPU
requirements. As such, the exact processor setup matters less than the
other resources. Note that if SSL is enabled, the CPU requirements can
be significantly higher (the exact details depend on the CPU type and
JVM implementation).
You should choose a modern processor with multiple cores. Common
clusters utilize 24 core machines.
If you need to choose between faster CPUs or more cores, choose more
cores. The extra concurrency that multiple cores offers will far
outweigh a slightly faster clock speed.
How to compute your throughput
It might also be helpful to compute the throughput. For example, if you have 800 messages per second, of 500 bytes each then your throughput is 800*500/(1024*1024) = ~0.4MB/s. Now if your topic is partitioned and you have 3 brokers up and running with 3 replicas that would lead to 0.4/3*3=0.4MB/s per broker.
More details regarding your architecture can be found in Confluent's whitepaper Apache Kafka and Confluent Reference Architecture. Here's the section for memory usage,
ZooKeeper uses the JVM heap, and 4GB RAM is typically sufficient. Too
small of a heap will result in high CPU due to constant garbage
collection while too large heap may result in long garbage collection
pauses and loss of connectivity within the ZooKeeper cluster.
Kafka brokers use both the JVM heap and the OS page cache. The JVM heap is used for replication of partitions between brokers and for log
compaction. Replication requires 1MB (default replica.max.fetch.size)
for each partition on the broker. In Apache Kafka 0.10.1 (Confluent
Platform 3.1), we added a new configuration
(replica.fetch.response.max.bytes) that limits the total RAM used for
replication to 10MB, to avoid memory and garbage collection issues
when the number of partitions on a broker is high. For log compaction,
calculating the required memory is more complicated and we recommend
referring to the Kafka documentation if you are using this feature.
For small to medium-sized deployments, 4GB heap size is usually
sufficient. In addition, it is highly recommended that consumers
always read from memory, i.e. from data that was written to Kafka and
is still stored in the OS page cache. The amount of memory this
requires depends on the rate at this data is written and how far
behind you expect consumers to get. If you write 20GB per hour per
broker and you allow brokers to fall 3 hours behind in normal
scenario, you will want to reserve 60GB to the OS page cache. In cases
where consumers are forced to read from disk, performance will drop
significantly
Kafka Connect itself does not use much memory, but some connectors buffer data internally for efficiency. If you run multiple connectors
that use buffering, you will want to increase the JVM heap size to 1GB
or higher.
Consumers use at least 2MB per consumer and up to 64MB in cases of large responses from brokers (typical for bursty traffic).
Producers will have a buffer of 64MB each. Start by allocating 1GB RAM and add 64MB for each producer and 16MB for each consumer planned.
There are many different factors that need to be taken into consideration when it comes to tune the configuration of your architecture. I would suggest to go through the aforementioned documentation, monitor your existing cluster and resources and finally tune them accordingly.

How much memory Kafka cluster needs?

How can i calculate how much memory and cpu my Kafka cluster needs?
My cluster consists from 3 nodes, with throughput of ~800 messages per second.
Currently they have (each) 6 GB ram, 2 CPU, 1T disk, and it seems to be not enough. How much would you allocate?

I think you want to start by profiling your kafka cluster.
See the answer to this post: CPU Profiling kafka brokers.
It basically recommends that you use a prometheus and grafana stack to visualize your load on a timeline - from this you should be able to determine your bottleneck. And links to an article that describes how.
Also, you may find the post interresting, because the poster seems to have about the same workload as you.

Apache Geode scaling

I'm trying to measure the performance of Geode
I have 3 identical hosts to test it.
I created a partitioned region.
I started a geode cluster with one server.
I do "get" and "put" operations in the loop.
I get about 50000 op/sec.
Add started a cluster with three geode nodes.
I do get and put operations in the loop.
I get the same 50000 op/sec.
I would expect to see the increased performance, but it is suprisingly the same for 1-node cluster and 3-nodes cluster.
Could you please help. What are the possible settings to change in order to get horizontal scalability.
Thank you.

Well, you just got horizontal scalability for data storage at no loss of throughput :)
To horizontally scale your throughput, I think your workload was not enough to max-out the server. You need to start multiple clients (OR threads in a single client) against a single server until you do not see throughput increase by adding any new clients. At this point you start a new server. This new server should allow you to add more clients and horizontally scale your throughput.
You may find the ycsb benchmark useful, which allows you to start multiple threads in a client to perform operations.

You should setuo and environment who you see a performance decrease with single node and then make same test with partitioned one.

Sharding vs DFS

As far as I understand sharding (e.g in MongoDB) and distributed file systems (e.g. HDFS in HBase or HyperTable) are different mechanisms that databases use to scale-out, however I wonder how do they compare?

Traditional sharding involves breaking tables into a small number of pieces and running each piece (or "shard") in a separate database on a separate machine. Because of the large shard size, this mechanism can be prone to imbalances due to hot spots and unequal growth as was evidenced by the Foursquare incident. Also, because each shard is run on a separate machine, these systems can experience availability problems if one of the machines goes down. To mitigate this problem, most sharding systems, including MongoDB, implement replica groups. Each machine is replaced by a set of three machines in a master plus two slaves configuration. This way if a machine goes down, there are two remaining replicas to serve the data. There are a couple of problems with this design: First, if a replica fails in a replica group, and the group is only left with two members, to bring the replication count back to three, the data on one of these two machines needs to be cloned. Since there are only two machines in the entire cluster that can be used to re-create the replica, there will be enormous drag on one of these two machines while re-replication is taking place, causing serious performance problems on the shard in question (it takes over two hours to copy 1TB over a gigabit link). The second problem is that when one of the replicas goes down, it needs to be replaced with a new machine. Even if there is plenty of spare capacity across the cluster to resolve the replication problem, that spare capacity cannot be used to rectify the situation. The only way to solve it is to replace the machine. This becomes very challenging from an operational standpoint as cluster sizes grow up into the hundreds or thousands of machines.
The Bigtable+GFS design solves these problems. First, the table data is broken down into much finer grained "tablets". A typical machine in a Bigtable cluster will often have 500+ tablets. If an imbalance occurs, resolving it is just a simple matter of migrating a small number of tablets from one machine to another. If a TabletServer goes down, because the data set is broken down and replicated with such fine granularity, there can be hundreds of machines that participate in the recovery process, which distributes the recovery burden and speeds recovery time. Also, because the data is not tied to a specific machine or machines, the spare capacity on all machines in the cluster can be applied to the failure. There is no operational requirement to replace the machine since any of the spare capacity throughout the cluster can be used to rectify replication imbalance.
Doug Judd
CEO, Hypertable Inc.