I have two kafka clusters say A and B, B is replica of A. I would like to consume messages from cluster B only if A is down and viceversa. Nevertheless consuming messages from both the clusters would result in duplicate messages. So is there any way I can configure my kafka consumer to receive messages from only one cluster.
Thanks--
So is there any way I can configure my kafka consumer to receive messages from only one cluster.
Yes: a Kafka consumer instance will always receive messages from one Kafka cluster only. That is, there's no built-in option to use the same consumer instance for reading from 2+ clusters. But I think you are looking for something different, see below.
I would like to consume messages from cluster B only if A is down and viceversa. Nevertheless consuming messages from both the clusters would result in duplicate messages.
There's no built-in failover support such as "switch to cluster B if cluster A fails" in Kafka's consumer API. If you need such behavior (as in your case), you would need to do so in your application that uses the Kafka consumer API.
For example, you could create a consumer instance to read from cluster A, monitor that instance and/or that cluster to determine whether failover to cluster B is required, and (if needed) perform the failover to B by creating another consumer instance to read from B in the event that A fails.
There are a few gotchas however that makes this failover behavior more complex than my simplified example. One difficulty is to know which messages from cluster A have already been read when switching over to B: this is tricky because, typically, the message offsets differ between clusters so determining whether the "copy" of a message (in B) was already read (from A) is not trivial.
Note: Sometimes you can simplify such an application / such a failover logic in situations where e.g. message processing is idempotent (i.e. where duplicate messages / duplicate processing of messages will not alter the processing outcome).
Related
I have built a micro service platform based on kubernetes, but Kafka is used as MQ in the service. Now a very confusing question has arisen. Kubernetes is designed to facilitate the expansion of micro services. However, when the expansion exceeds the number of Kafka partitions, some micro services cannot be consumed. What should I do?
This is a Kafka limitation and has nothing to do with your service scheduler.
Kafka consumer groups simply cannot scale beyond the partition count. So, if you have a single partitioned topic because you care about strict event ordering, then only one replica of your service can be active and consuming from the topic, and you'd need to handle failover in specific ways that is outside the scope of Kafka itself.
If your concern is the k8s autoscaler, then you can look into the KEDA autoscaler for Kafka services
Kafka, as OneCricketeer notes, bounds the parallelism of consumption by the number of partitions.
If you couple processing with consumption, this limits the number of instances which will be performing work at any given time to the number of partitions to be consumed. Because the Kafka consumer group protocol includes support for reassigning partitions consumed by a crashed (or non-responsive...) consumer to a different consumer in the group, running more instances of the service than there are partitions at least allows for the other instances to be hot spares for fast failover.
It's possible to decouple processing from consumption. The broad outline of could be to have every instance of your service join the consumer group. Up to the number of instances consuming will actually consume from the topic. They can then make a load-balanced network request to another (or the same) instance based on the message they consume to do the processing. If you allow the consumer to have multiple requests in flight, this expands your scaling horizon to max-in-flight-requests * number-of-partitions.
If it happens that the messages in a partition don't need to be processed in order, simple round-robin load-balancing of the requests is sufficient.
Conversely, if it's the case that there are effectively multiple logical streams of messages multiplexed into a given partition (e.g. if messages are keyed by equipment ID; the second message for ID A needs to be processed after the first message, but could be processed in any order relative to messages from ID B), you can still do this, but it needs some care around ensuring ordering. Additionally, given the amount of throughput you should be able to get from a consumer of a single partition, needing to scale out to the point where you have more processing instances than partitions suggests that you'll want to investigate load-balancing approaches where if request B needs to be processed after request A (presumably because request A could affect the result of request B), A and B get routed to the same instance so they can leverage local in-memory state rather than do a read-from-db then write-to-db pas de deux.
This sort of architecture can be implemented in any language, though maintaining a reasonable level of availability and consistency is going to be difficult. There are frameworks and toolkits which can deliver a lot of this functionality: Akka (JVM), Akka.Net, and Protoactor all implement useful primitives in this area (disclaimer: I'm employed by Lightbend, which maintains and provides commercial support for one of those, though I'd have (and actually have) made the same recommendations prior to my employment there).
When consuming messages from Kafka in this style of architecture, you will definitely have to make the choice between at-most-once and at-least-once delivery guarantees and that will drive decisions around when you commit offsets. Note particularly that you need to be careful, if doing at-least-once, to not commit until every message up to that offset has been processed (or discarded), lest you end up with "at-least-zero-times", which isn't a useful guarantee. If doing at-least-once, you may also want to try for effectively-once: at-least-once with idempotent processing.
I have three Kafka clusters: A, B, and C. I have data incoming on Cluster A on topic incoming.dataA and Cluster B on topic incoming.dataB
I need a way to send all messages received on incoming.dataA on Cluster A and incoming.dataB on Cluster B to a topic on Cluster C, received.data. Can this be done?
I am aware of mirroring and streaming but neither of those help when forwarding data from kafka cluster to another (when their topic names differ).
MirrorMaker can only be used between two clusters, so you'd have to chain A->B->C
Your next option would be to use some Apache projects (or just regular client app) such as Spark/Flink/Beam/Nifi/Camel to consume from each cluster with individually configured consumers, and somehow forward records with a single producer client (would be recommended to join the data first, somehow, assuming order or some characteristics matter)
=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.
I have a use case where messages are coming from a channel, which we want to push into a Kafka topic(multiple partitions) . In our case message order is important so we have to push the messages to topic in the order they are received which looks very straight forward if we have only one producer and single partition. In our case, for load balancing and scalability we want to run multiple instances for same producer but the problem is how to maintain order of messages.
Any thought or solution would be great helpful.
Even if I think to have single partition can it replicated to multiple brokers for availability and fault tolerance?
we have to push the messages to topic in the order they are received
which looks very straight forward if we have only one producer and
single partition
You can have multiple partitions in the topic with one producer and still have the order maintained if you provide key for your messages. All messages with the same key produced by a single producer are always in order.
When you say multiple producers, I assume that you are having multiple instances of your application running and that you are not creating multiple producers in the same JVM instance.
Since you said channel, I suppose that it is a network channel like Datagram channel, for example. In that case, I suppose that you are listening on some port and sending the received data to Kafka.
I do not see a point in having multiple producers in the same instance
producing to the same topic, so it is better to have a single producer
send all the messages and for performance you can tune the producer
properties like batch.size, linger.ms etc.
To achieve fault tolerance, have another instance running in HA mode (fail-over mode), so that if this instance dies the other automatically picks up.
If it is a network channel, you can run multiple instances and open
the socket with the option SO_REUSEADDR in
StandardSocketOptions and this way you only one producer will be
active at any point and new producer will become active once the
active one dies.
I have two applications one is a regular Kafka consumer and the other is a gRPC based microservice. Kafka consumer is only responsible for consumption of messages and the business logic resides within the microservice. Also the key for messages within our Kafka topic is null, so Kafka does round-robin assignment of messages to the partitions which distributes the incoming messages evenly to all partitions. At the end of the day I am dealing with non-transactional storage (BigTable) so I have to make sure that there is only one thread responsible for reading, updating and writing a row-key into the storage in order to avoid race-conditions. My gRPC microservice is running within a Kubernetes cluster on multiple pods, how can I make sure that a message object belonging to a particular row-key goes to the same pod within the Kubernetes cluster so that there are no race-conditions?? My microservice is responsible for writing the final output to the BigTable and the microservice is sitting behind a load balancer.
It might not be a solution if you already have a (big) code base, but streaming frameworks like Apache Flink handle this pretty gracefully.
It has an operator keyBy() that does exactly what you want. It will 'sort' the messages by a key defined by you and will guarantee messages with the same key get processed by the same thread.