Running Kafka cost-effectively at the expense of lower resilience - apache-kafka

Let's say I have a cheap and less reliable datacenter A, and an expensive and more reliable datacenter B. I want to run Kafka in the most cost-effective way, even if that means risking data loss and/or downtime. I can run any number of brokers in either datacenter, but remember that costs need to be as low as possible.
For this scenario, assume that no costs are incurred if brokers are not running. Also assume that producers/consumers run completely reliably with no concern for their cost.
Two ideas I have are as follows:
Provision two completely separate Kafka clusters, one in each datacenter, but keep the cluster in the more expensive datacenter (B) powered off. Upon detecting an outage in A, power on the cluster in B. Producers/consumers will have logic to switch between clusters.
Run the Zookeeper cluster in B, with powered on brokers in A, and powered off brokers in B. If there is an outage in A, then brokers in B come online to pick up where A left off.
Option 1 would be cheaper, but requires more complexity in the producers/consumers. Option 2 would be more expensive, but requires less complexity in the producers/consumers.
Is Option 2 even possible? If there is an outage in A, is there any way to have brokers in B come online, get elected as leaders for the topics and have the producers/consumers seamlessly start sending to them? Again, data loss is okay and so is switchover downtime. But whatever option needs to not require manual intervention.
Is there any other approach that I can consider?

Neither is feasible.
Topics and their records are unique to each cluster. Only one leader partition can exist for any Kafka partition in a cluster.
With these two pieces of information, example scenarios include:
Producers cut over to a new cluster, and find the new leaders until old cluster comes back
Even if above could happen instantaneously, or with minimal retries, consumers then are responsible for reading from where? They cannot aggregate data from more than one bootstrap.servers at any time.
So, now you get into a situation where both clusters always need to be available, with N consumer threads for N partitions existing in the other cluster, and M threads for the original cluster
Meanwhile, producers are back to writing to the appropriate (cheaper) cluster, so data will potentially be out of order since you have no control which consumer threads process what data first.
Only after you track the consumer lag from the more expensive cluster consumers will you be able to reasonably stop those threads and shut down that cluster upon reaching zero lag across all consumers
Another thing to keep in mind is that topic creation/update/delete events aren't automatically synced across clusters, so Kafka Streams apps, especially, will all be unable to maintain state with this approach.
You can use tools like MirrorMaker or Confluent Replicator / Cluster Linking to help with all this, but the client failover piece I've personally never seen handled very well, especially when record order and idempotent processing matters
Ultimately, this is what availability zones are for. From what I understand, the chances of a cloud provider losing more than one availability zone at a time is extremely rare. So, you'd setup one Kafka cluster across 3 or more availability zones, and configure "rack awareness" for Kafka to account for its installation locations.

If you want to keep the target / passive cluster shutdown while not operational and then spin up the cluster you should be ok if you don't need any history and don't care about the consumer lag gap in the source cluster.. obv use case dependent.
MM2 or any sort of async directional replication requires the cluster to be active all the time.
Stretch cluster is not really doable b/c of the 2 dc thing, whether raft or zk you need a 3rd dc for that, and that would probably be your most expensive option.
Redpanda has the capability of offloading all of your log segments to s3 and then indexes them to allow them to be used for other clusters, so if you constantly wrote one copy of your log segments to your standby DC storage array with s3 interface it might be palatable. Then whenever needed you just spin up a cluster on demand in the target dc and point it to the object store and you can immediately start producing and consuming with your new clients.

Related

Understanding Kafka Horizontally Scaling: When do I need a new Broker?

When we want to horizontally scale a kafka cluster, we can do so by adding more kafka brokers. I I have been trying to understand, how does one decide "now is the time to add a broker to a kafka cluster"?
For sake of simplicity assuming that failover is not a requirement.
Are there:
any best practices in choosing the number of kafka brokers?
metrics that should be monitored? (number of partitions / consumers / producers / bytes trasnferred?)
any resources, we could use to monitor the health of the kafka brokers?
Apart from host machine storage, CPU, Memory and Networking, since Kafka brokers runs on JVM, all the important stats like heap memory, threads are important to
monitor and should be considered when scaling needs arises. From perspective of messaging functionality since topics are implemented as partitions
or commit ahead logs which are appended serially it's important to monitor the disk writes as well. Similarly since Kafka stores the messages persistently
for a predefined period - the storage size for each partition and overall size is also important.
You can use performance scripts provided by Kafka (here and here) and experiment with different setting like here to try out different configurations, monitor your cluster during the performance run and then decide right scaling for your use case.

Open source multi region, consistent at-least-once FIFO solution: Dedicated Queue (e.g. Kafka) vs Database (e.g. Cassandra, RethinkDB)?

I've been searching for a FIFO solution where producers and consumers can be deployed in multiple data-centers, in different regions (e.g. >20ms ping). Obviously paying the price of increased latency, the main goal is to handle transparently the increased latency, spikes in latency, link failures.
This theoretical use-case is like this:
Super Fast Producer --sticky-load-balancing-with-fail-over--> Multi-Region Processors -->
Queue(FIFO based on order established by the producer) --> Multi-Region Consumers with fail-over
Consumers should not consume from the same "queue" at the same time, however, let's not consider the scaling aspect here. If the replication and fail-over work well for one "queue" the partitioning can be applied even at the application level with a decent amount of effort.
Thoughts:
In order for fail-over to work correctly, the Queue (e.g. messages, consumer offsets) must be active-active synchronously replicated between data centers. I don't see how an active-standby asynchronous topology can work without losing messages or break FIFO in failure scenarios.
Kafka stretch cluster would be perfect, although it can span multiple availability zones (<2ms ping and stable connections), most people advise against multiple regions (>15ms ping, unstable connections).
Confluent Platform 5.4 with the synchronous replication feature is in Preview, we could fail-over consumers at the application level in case the local cluster is down. Since data is replicated synchronously we should not break FIFO or lose messages during fail-over. In order to ensure a more active-active setup, we could rotate the Consumers periodically between data centers (e.g. once or twice a day in off-peak hours).
A DB (like Cassandra) can handle consistency across multiple data-center/regions. However, a queue use-case is an anti-pattern (Using Cassandra as a Queue).
The first point would be about the pure insert/delete workload which will make the DB work really hard to remove tombstones. It is sub-optimal use of the DB, but if it can handle the workload reliably than it is not a problem IMHO
The second point is about polling, consumers will generate a large amount of quorum reads just for polling the DB even if there is no data. Again IMHO Cassandra will handle this reliably even if it is a poor use of its capabilities.
Using a DB with notifications, like CouchDB/RethinkDB. CouchDB's replication is asynchronous so I do not see how Consumers can have a consistent view of the queue. For RethinkDB I am not sure how reliable it works across regions with majority reads and writes.
Have you deployed such "queues" in production, which would you choose?
Kafka supports 2 patterns Publish-Subscribe and Message Queue. There are some places discussed the differences. here
The problem you stated can be solved using Kafka. The FIFO queue can be implemented using the topic/partition/key message. All messages with the same key will belong to the same partition hence we can achieve the FIFO attribute. In case you want to increase the consuming throughput, you just need to increase the total of partitions per topic and increase number of consumers.
Other queues such as RabbitMQ are not easy, though. For load balancing the workload, we must use the separate queue which increasing the management cost.
You can implement many kinds of delivery semantics such as at-most-once, at-least-once, exactly-once (literally) at the producer side and the consumer side. Kafka also supports multi-center deployments.
Cassandra is not designed for queue modeling, and as you said using Cassandra as a queue is an anti-pattern. It can turn quick into a nightmare.
The main problem with the queue is the deletes (Cassandra doesn't perform well for frequently updated data anyway).
Here is a link that might help you understanding delete/queue.
https://lostechies.com/ryansvihla/2014/10/20/domain-modeling-around-deletes-or-using-cassandra-as-a-queue-even-when-you-know-better/

Max size production kafka cluster deployment for now

I am considering how to deploy our kafka cluster: a big cluster with several broker groups or several clusters. If a big cluster, I want to know how big a kafka cluster can be. kafka has a controller node and I don't know how many brokers it can support. And another one is _consume_offset_ topic ,how big it can be and can we add more partitions to it.
I've personally worked with production Kafka clusters anywhere from 3 brokers to 20 brokers. They've all worked fine, it just depends on what kind of workload you're throwing at it. With Kafka, my general recommendation is that it's better to have a smaller amount of larger/more-powerful brokers, than having a bunch of tiny servers.
For a standing cluster, each broker you add increases "crosstalk" between the nodes, since they have to move partitions around, replicate data, as well as maintain the metadata in sync. This additional network chatter can impact how much load the broker can handle. As a general rule, adding brokers will add overall capacity, but you have to shift partitions around so that the load will be balanced properly across the entire cluster. Because of that, it's much better to start with 10 nodes, so that topics and partitions will be spread out evenly from the beginning, than starting out with 6 nodes and then adding 4 nodes later.
Regardless of the size of the cluster, there is always only one controller node at a time. If that node happens to go down, another node will take over as controller, but only one can be active at a given time, assuming the cluster is not in an unstable state.
The __consumer_offsets topic can have as many partitions as you want, but it comes by default set to 50 partitions. Since this is a compacted topic, assuming that there is no excessive committing happening (this has happened to me twice already in production environments), then the default settings should be enough for almost any scenario. You can look up the configuration settings for consumer offsets topics by looking for broker properties that start with offsets. in the official Kafka documentation.
You can get more details at the official Kafka docs page: https://kafka.apache.org/documentation/
The size of a cluster can be determined by the following ways.
The most accurate way to model your use case is to simulate the load you expect on your own hardware.You can use the kafka load generation tools kafka-producer-perf-test and kafka-consumer-perf-test.
Based on the producer and consumer metrics, we can decide the number of brokers for our cluster.
The other approach is without simulation, which is based on the estimated rate at which you get data that required data retention period.
We can also calculate the throughput and based on that we can also decide the number of brokers in our cluster.
Example
If you have 800 messages per second, of 500 bytes each then your throughput is 800*500/(1024*1024) = ~0.4MB/s. Now if your topic is partitioned and you have 3 brokers up and running with 3 replicas that would lead to 0.4/3*3=0.4MB/s.
More details about the architecture are available at confluent.
Within a Kafka Cluster, a single broker works as a controller. If you have a cluster of 100 brokers then one of them will act as the controller.
If we talk internally, each broker tries to create a node(ephemeral node) in the zookeeper(/controller). The first one becomes the controller. The other brokers get an exception("node already exists"), they set a watch on the controller. When the controller dies, the ephemeral node is removed and watching brokers are notified for the controller selection process.
The functionality of the controller can be found here.
The __consumer_offset topic is used to store the offsets committed by consumers. Its default value is 50 but it can be set for more partitions. To change, set the offsets.topic.num.partitions property.

Apache Kafka isolate different producers

I'm working on a project where different producers (each one represented by another customer) can send events to my service.
This service is responsible for receiving those events and storing them in intermediate Kafka topic, later we are fetching and processing those events.
The problem is that one customer can flood events and effect processing of events of another customers, i'm trying to find a best way to create a level of isolation between different customers!
So far, i was able to solve this, by creating different topic for each customer.
Although this solution temporary solved the issue, it seems that Kafka is not designed to handle well huge number of topics 100k+ as our producers (customers) number grew up we started to experience that controlled restart of a single broker takes up to a few hours.
Can anyone suggest a better way to create level of isolation between producers?
You can take a look at Kafka limits, that is done on Kafka broker level. By configuring producers to have different user / client-id each, you could achieve some level of limiting (so that one producer does not flood others).
See https://kafka.apache.org/documentation.html#design_quotas
With the number (100k+) that you mentioned I think that you will probably need to solve this issue in your service that sits before Kafka.
Kafka can most probably (without knowing exact numbers) handle the load that you throw at it, but there is a limit to the number of partitions per broker that can be handled in a performant way. As usual there are no fixed limits for this, but I'd say the number of partitions per broker is more in the lower 4-figures, so unless you have a fairly large cluster you probably have many more than that. This can lead to longer restart times as all these partitions have to be recovered. What you could try is to experiment with the num.recovery.threads.per.data.dir parameter and set this higher, which could bring your restart times down.
I'd recommend consolidating topics to get the number down though and implementing some sort of flow control in the service that your customers talk to, maybe add a load balancer to be able to scale that service ..

Kafka Topology Best Practice

I have 4 machines where a Kafka Cluster is configured with topology that
each machine has one zookeeper and two broker.
With this configuration what do you advice for maximum topic&partition for best performance?
Replication Factor 3:
using kafka 0.10.XX
Thanks?
Each topic is restricted to 100,000 partitions no matter how many nodes (as of July 2017)
As to the number of topics that depends on how large the smallest RAM is across the machines. This is due to Zookeeper keeping everything in memory for quick access (also it doesnt shard the znodes, just replicates across ZK nodes upon write). This effectively means once you exhaust one machines memory that ZK will fail to add more topics. You will most likely run out of file handles before reaching this limit on the Kafka broker nodes.
To quote the KAFKA docs on their site (6.1 Basic Kafka Operations https://kafka.apache.org/documentation/#basic_ops_add_topic):
Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic name, appended by a dash (-) and the partition id. Since a typical folder name can not be over 255 characters long, there will be a limitation on the length of topic names. We assume the number of partitions will not ever be above 100,000. Therefore, topic names cannot be longer than 249 characters. This leaves just enough room in the folder name for a dash and a potentially 5 digit long partition id.
To quote the Zookeeper docs (https://zookeeper.apache.org/doc/trunk/zookeeperOver.html):
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.
Performance:
Depending on your publishing and consumption semantics the topic-partition finity will change. The following are a set of questions you should ask yourself to gain insight into a potential solution (your question is very open ended):
Is the data I am publishing mission critical (i.e. cannot lose it, must be sure I published it, must have exactly once consumption)?
Should I make the producer.send() call as synchronous as possible or continue to use the asynchronous method with batching (do I trade-off publishing guarantees for speed)?
Are the messages I am publishing dependent on one another? Does message A have to be consumed before message B (implies A published before B)?
How do I choose which partition to send my message to?
Should I: assign the message to a partition (extra producer logic), let the cluster decide in a round robin fashion, or assign a key which will hash to one of the partitions for the topic (need to come up with an evenly distributed hash to get good load balancing across partitions)
How many topics should you have? How is this connected to the semantics of your data? Will auto-creating topics for many distinct logical data domains be efficient (think of the effect on Zookeeper and administrative pain to delete stale topics)?
Partitions provide parallelism (more consumers possible) and possibly increased positive load balancing effects (if producer publishes correctly). Would you want to assign parts of your problem domain elements to specific partitions (when publishing send data for client A to partition 1)? What side-effects does this have (think refactorability and maintainability)?
Will you want to make more partitions than you need so you can scale up if needed with more brokers/consumers? How realistic is automatic scaling of a KAFKA cluster given your expertise? Will this be done manually? Is manual scaling viable for your problem domain (are you building KAFKA around a fixed system with well known characteristics or are you required to be able to handle severe spikes in messages)?
How will my consumers subscribe to topics? Will they use pre-configured configurations or use a regex to consume many topics? Are the messages between topics dependent or prioritized (need extra logic on consumer to implement priority)?
Should you use different network interfaces for replication between brokers (i.e. port 9092 for producers/consumers and 9093 for replication traffic)?
Good Links:
http://cloudurable.com/ppt/4-kafka-detailed-architecture.pdf
https://www.slideshare.net/ToddPalino/putting-kafka-into-overdrive
https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844
https://kafka.apache.org/documentation/