We are thinking about using the Strimzi Kafka-Bridge(https://strimzi.io/docs/bridge/latest/#proc-creating-kafka-bridge-consumer-bridge) as HTTP(s) Gateway to an existing Kafka Cluster.
The documentation mentions the creation of consumers using arbitrary names for taking part in a consumer-group. These names can subsequently be used to consume messages, seek or sync offsets,...
The question is: Am I right in assuming the following?
The bridge-consumers seem to be created and maintained just in one Kafka-Bridge instance.
If I want to use more than one bridge because of fault-tolerance-requirements, the name-information about a specific consumer will not be available on the other nodes, since there is no synchronization or common storage between the bridge-nodes.
So if the clients of the kafka-bridge are not sticky, as soon as a it communicates (e.g. because of round-robin handling by a load-balancer) with another node, the consumer-information will not be available and the http(s)-clients must be prepared to reconfigure the consumers on the new communicating node.
The offsets will be lost. Worst case the fetching of messages and syncing their offsets will always happen on different nodes.
Or did I overlook anything?
You are right. The state and the Kafka connections are currently not shared in any way between the bridge instances. The general recommendation is that when using consumers, you should run the bridge only with single replica (and if needed deploy different bridge instances for different consumer groups).
Related
Let's say I have a cheap and less reliable datacenter A, and an expensive and more reliable datacenter B. I want to run Kafka in the most cost-effective way, even if that means risking data loss and/or downtime. I can run any number of brokers in either datacenter, but remember that costs need to be as low as possible.
For this scenario, assume that no costs are incurred if brokers are not running. Also assume that producers/consumers run completely reliably with no concern for their cost.
Two ideas I have are as follows:
Provision two completely separate Kafka clusters, one in each datacenter, but keep the cluster in the more expensive datacenter (B) powered off. Upon detecting an outage in A, power on the cluster in B. Producers/consumers will have logic to switch between clusters.
Run the Zookeeper cluster in B, with powered on brokers in A, and powered off brokers in B. If there is an outage in A, then brokers in B come online to pick up where A left off.
Option 1 would be cheaper, but requires more complexity in the producers/consumers. Option 2 would be more expensive, but requires less complexity in the producers/consumers.
Is Option 2 even possible? If there is an outage in A, is there any way to have brokers in B come online, get elected as leaders for the topics and have the producers/consumers seamlessly start sending to them? Again, data loss is okay and so is switchover downtime. But whatever option needs to not require manual intervention.
Is there any other approach that I can consider?
Neither is feasible.
Topics and their records are unique to each cluster. Only one leader partition can exist for any Kafka partition in a cluster.
With these two pieces of information, example scenarios include:
Producers cut over to a new cluster, and find the new leaders until old cluster comes back
Even if above could happen instantaneously, or with minimal retries, consumers then are responsible for reading from where? They cannot aggregate data from more than one bootstrap.servers at any time.
So, now you get into a situation where both clusters always need to be available, with N consumer threads for N partitions existing in the other cluster, and M threads for the original cluster
Meanwhile, producers are back to writing to the appropriate (cheaper) cluster, so data will potentially be out of order since you have no control which consumer threads process what data first.
Only after you track the consumer lag from the more expensive cluster consumers will you be able to reasonably stop those threads and shut down that cluster upon reaching zero lag across all consumers
Another thing to keep in mind is that topic creation/update/delete events aren't automatically synced across clusters, so Kafka Streams apps, especially, will all be unable to maintain state with this approach.
You can use tools like MirrorMaker or Confluent Replicator / Cluster Linking to help with all this, but the client failover piece I've personally never seen handled very well, especially when record order and idempotent processing matters
Ultimately, this is what availability zones are for. From what I understand, the chances of a cloud provider losing more than one availability zone at a time is extremely rare. So, you'd setup one Kafka cluster across 3 or more availability zones, and configure "rack awareness" for Kafka to account for its installation locations.
If you want to keep the target / passive cluster shutdown while not operational and then spin up the cluster you should be ok if you don't need any history and don't care about the consumer lag gap in the source cluster.. obv use case dependent.
MM2 or any sort of async directional replication requires the cluster to be active all the time.
Stretch cluster is not really doable b/c of the 2 dc thing, whether raft or zk you need a 3rd dc for that, and that would probably be your most expensive option.
Redpanda has the capability of offloading all of your log segments to s3 and then indexes them to allow them to be used for other clusters, so if you constantly wrote one copy of your log segments to your standby DC storage array with s3 interface it might be palatable. Then whenever needed you just spin up a cluster on demand in the target dc and point it to the object store and you can immediately start producing and consuming with your new clients.
Is there a way to automatically tell Kafka to send all events of a specific topic to a specific table of a database?
In order to avoid creating a new consumer that needs to read from that topic and perform the copy explicitly.
You have two options here:
Kafka Connect - this is the standard way to connect your Kafka to a database. There are a lot of connectors. In order to choose one:
The best bet is to use the specific one for your database that is maintained by confluent.
If you don't have a specific one, the second best option is to use the JDBC connector.
Direct ingestion from the database if your database supports it (for instance Clickhouse, and MemSQL are able to load data coming from a Kafka topic). The difference between this and Kafka connects is this way it is fully supported and tested by the db vendor and you need to maintain less pieces of infrastructure.
Which one is better? It depends on:
your data volume
how much you can (and need !) to paralelize the load
and how much you can tolerate downtime or latencies.
Direct ingestion from DB is usually from one node (consumer) to Kafka.
It is good for mid-low volume data traffic. If it fails (or throttles), you might have latency issues.
Kafka connect allows you to insert data in parallel into the db using several workers. If one of the worker fails, the load is redistributed among the others. If you have a lot of data, this probably the best way to load it into the db, but you'll need to take care of the kafka connect infrastructure unless you're using a managed cloud offering.
On a Kafka Broker, it's recommended to use multiple drives for the message logs to improve throughput. That's why they have a log.dirs property that can have multiple directories that will be assigned to partitions in a round-robin fashion.
We have a lot of installations that we already setup this way for event-driven kafka applications, where we have like 4 nodes with 5 disks each.
Now we want to use Kafka-Streams with a Key-Value store where we persist computed data for fast range queries. We see that Kafka-Streams maps the partitions 1-on-1 to multiple statestores, and creates a separate subdirectory for each one.
However, we can't configure how to spread those subdirectories across different disks. We can only configure a single parent directory as 'state.dir' (StreamsConfig.STATE_DIR_CONFIG).
Is there a configuration I am missing? Or is having multiple disks not so relevant for Kafka Streams?
It's not really relevant, but this must be handled at the OS level via RAID configurations, for example.
Or you can implement the StateStore interface and write your own provider that can use multiple disks (or remote distributed filesystems)
I have two applications one is a regular Kafka consumer and the other is a gRPC based microservice. Kafka consumer is only responsible for consumption of messages and the business logic resides within the microservice. Also the key for messages within our Kafka topic is null, so Kafka does round-robin assignment of messages to the partitions which distributes the incoming messages evenly to all partitions. At the end of the day I am dealing with non-transactional storage (BigTable) so I have to make sure that there is only one thread responsible for reading, updating and writing a row-key into the storage in order to avoid race-conditions. My gRPC microservice is running within a Kubernetes cluster on multiple pods, how can I make sure that a message object belonging to a particular row-key goes to the same pod within the Kubernetes cluster so that there are no race-conditions?? My microservice is responsible for writing the final output to the BigTable and the microservice is sitting behind a load balancer.
It might not be a solution if you already have a (big) code base, but streaming frameworks like Apache Flink handle this pretty gracefully.
It has an operator keyBy() that does exactly what you want. It will 'sort' the messages by a key defined by you and will guarantee messages with the same key get processed by the same thread.
I have 4 machines where a Kafka Cluster is configured with topology that
each machine has one zookeeper and two broker.
With this configuration what do you advice for maximum topic&partition for best performance?
Replication Factor 3:
using kafka 0.10.XX
Thanks?
Each topic is restricted to 100,000 partitions no matter how many nodes (as of July 2017)
As to the number of topics that depends on how large the smallest RAM is across the machines. This is due to Zookeeper keeping everything in memory for quick access (also it doesnt shard the znodes, just replicates across ZK nodes upon write). This effectively means once you exhaust one machines memory that ZK will fail to add more topics. You will most likely run out of file handles before reaching this limit on the Kafka broker nodes.
To quote the KAFKA docs on their site (6.1 Basic Kafka Operations https://kafka.apache.org/documentation/#basic_ops_add_topic):
Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic name, appended by a dash (-) and the partition id. Since a typical folder name can not be over 255 characters long, there will be a limitation on the length of topic names. We assume the number of partitions will not ever be above 100,000. Therefore, topic names cannot be longer than 249 characters. This leaves just enough room in the folder name for a dash and a potentially 5 digit long partition id.
To quote the Zookeeper docs (https://zookeeper.apache.org/doc/trunk/zookeeperOver.html):
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.
Performance:
Depending on your publishing and consumption semantics the topic-partition finity will change. The following are a set of questions you should ask yourself to gain insight into a potential solution (your question is very open ended):
Is the data I am publishing mission critical (i.e. cannot lose it, must be sure I published it, must have exactly once consumption)?
Should I make the producer.send() call as synchronous as possible or continue to use the asynchronous method with batching (do I trade-off publishing guarantees for speed)?
Are the messages I am publishing dependent on one another? Does message A have to be consumed before message B (implies A published before B)?
How do I choose which partition to send my message to?
Should I: assign the message to a partition (extra producer logic), let the cluster decide in a round robin fashion, or assign a key which will hash to one of the partitions for the topic (need to come up with an evenly distributed hash to get good load balancing across partitions)
How many topics should you have? How is this connected to the semantics of your data? Will auto-creating topics for many distinct logical data domains be efficient (think of the effect on Zookeeper and administrative pain to delete stale topics)?
Partitions provide parallelism (more consumers possible) and possibly increased positive load balancing effects (if producer publishes correctly). Would you want to assign parts of your problem domain elements to specific partitions (when publishing send data for client A to partition 1)? What side-effects does this have (think refactorability and maintainability)?
Will you want to make more partitions than you need so you can scale up if needed with more brokers/consumers? How realistic is automatic scaling of a KAFKA cluster given your expertise? Will this be done manually? Is manual scaling viable for your problem domain (are you building KAFKA around a fixed system with well known characteristics or are you required to be able to handle severe spikes in messages)?
How will my consumers subscribe to topics? Will they use pre-configured configurations or use a regex to consume many topics? Are the messages between topics dependent or prioritized (need extra logic on consumer to implement priority)?
Should you use different network interfaces for replication between brokers (i.e. port 9092 for producers/consumers and 9093 for replication traffic)?
Good Links:
http://cloudurable.com/ppt/4-kafka-detailed-architecture.pdf
https://www.slideshare.net/ToddPalino/putting-kafka-into-overdrive
https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844
https://kafka.apache.org/documentation/