Ensure Fairness in Publisher/Subscriber Pattern - apache-kafka

How can I ensure fairness in the Pub/Sub Pattern in e.g. kafka when one publisher produces thousands of messages, while all other producers are in a low digit of messages? It's not predictable which producer will have high activity.
It would be great if other messages from other producers don't have to wait hours just because one producer is very very active.
What are the patterns for that? Is it possible with Kafka or another technology like Google PubSub? If yes, how?
Multiple partitions also doesn't work very well in that case, or I can see how.

In Kafka, you could utilise the concept of quotas to prevent a certain clients to monopolise the cluster resources.
There are 2 types of quotas that can be enforced:
Network bandwidth quotas
Request rate quotas
More detailed information on how these can be configured can be found in the official documentation of Kafka.

Related

Kafka P2P Header based routing

I have a requirement to send an event to multiple systems based on their system code. Destination system can grow in future and they should be able to subscribe only to the interested events. Its a security mandate, so as a producer we need to ensure this.
We could use RabbitMQ header exchange and use multiple shovel configurations to the different queues in different vhost or cluster. But I am looking for a similar pattern with Kafka.
If we maintain different topic and authorise the consumer to their corresponding topic, it can grow in future, so as a producer I need to do the topic routing logic and the number of topics will grow.
The other option is to use AWS SNS and subscribe multiple SQS queues. Based on filter policies the message can be routed.
Could anyone think of a better solution to this problem?
send an event to multiple systems based on their system code
Using Kafka Streams API, you can use branching to route data to different topics based on Predicate logic
Once data is in their respective topics, "multiple systems" can consume them

ActiveMQ Artemis topology for enterprise-wide messaging in clustering mode

We are trying to build a solution on ActiveMQ Artemis which offers a high availability solution where multiple producers can put message and at the same time, a huge set of consumer can come and pick the message. In theory we should be able to deal with millions of message/hour.
I have checked through documentation and a few sites that does not talk about clustering, yes options are there like failover, etc.
Could someone please help me redirect to a good documentation which talk a more on this space and which topology is idea on which case.

Kafka connect throttling

I have a requirement to consume messages on behalf of a set of lazy consumers who just exposes REST APIs. Therefore, I am planning to have Sink Connectors which fetches messages from Kafka topics and does HTTP POST operation on the exposed APIs.
One of the key factors for consideration is throttling. What mechanism do you suggest for throttling the Sink Tasks to meet the tier SLA of the APIs. I understand that Kafka has client quota feature, however, what is the optimum mechanism to keep track of API requests/min or sec which would allow to adjust the client quota dynamically ?
I think the best way to implement rate-limiting for your REST API would be in your connector code by blocking if necessary in SinkTask.put(). You may want to think about whether rate-limiting at the level of your SinkTasks is sufficient or you need it to be global (more complex since coordination involved).
The advantage of using Kafka quotas which you were considering is that the distributed aspect is handled for you, however I believe those can currently only be configured in terms of bytes transferred.

Why is Kafka pull-based instead of push-based?

Why is Kafka pull-based instead of push-based? I agree Kafka gives high throughput as I had experienced it, but I don't see how Kafka throughput would go down if it were to pushed based. Any ideas on how push-based can degrade performance?
Scalability was the major driving factor when we design such systems (pull vs push). Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.
Kafka can handle events at 100k+ per second rate coming from producers. Because Kafka consumers pull data from the topic, different consumers can consume the messages at different pace. Kafka also supports different consumption models. You can have one consumer processing the messages at real-time and another consumer processing the messages in batch mode.
The other reason could be that Kafka was designed not only for single consumers like Hadoop. Different consumers can have diverse needs and capabilities.
Pull-based systems have some deficiencies like resources wasting due to polling regularly. Kafka supports a 'long polling' waiting mode until real data comes through to alleviate this drawback.
Refer to the Kafka documentation which details the particular design decision: Push vs pull
Major points that were in favor of pull are:
Pull is better in dealing with diversified consumers (without a broker determining the data transfer rate for all);
Consumers can more effectively control the rate of their individual consumption;
Easier and more optimal batch processing implementation.
The drawback of a pull-based systems (consumers polling for data while there's no data available for them) is alleviated somewhat by a 'long poll' waiting mode until data arrives.
Others have provided answers based on Kafka's documentation but sometimes product documentation should be taken with a grain of salt as an absolute technical reference. For example:
Numerous push-based messaging systems support consumption at
different rates, usually through their session management primitives.
You establish/resume an active application layer session when you
want to consume and suspend the session (e.g. by simply not
responding for less than the keepalive window and greater than the in-flight windows...or with an explicit message) when you want to
stop/pause. MQTT and AMQP, for example both provide this capability
(in MQTT's case, since the late 90's). Given that no actions are
required to pause consumption (by definition), and less traffic is
required under steady stable state (no request), it is difficult to
see how Kafka's pull-based model is more efficient.
One critical advantage push messaging has vs. pull messaging is that
there is no request traffic to scale as the number of potentially
active topics increases. If you have a million potentially active
topics, you have to issue queries for all those topics. This
concern becomes especially relevant at scale.
The critical advantage pull messaging has vs push messaging is replayability. This factors a great deal into whether downstream systems can offer guarantees around processing (e.g. they might fail before doing so and have to restart or e.g. fail to write messages recoverably).
Another critical advantage for pull messaging vs push messaging is buffer allocation. A consuming process can explicitly request as much data as they can accommodate in a pre-allocated buffer, rather than having to allocate buffers over and over again. This gains back some of the goodput losses vs push messaging from query scaling (but not much). The impact here is measurable, however, if your message sizes vary wildly (e.g. a few KB->a few hundred MB).
It is a fallacy to suggest that pull messaging has structural scalability advantages over push messaging. Partitioning is what is usually used to provide scale in messaging applications, regardless of the consumption model. There are push messaging systems operating well in excess of 300M msgs/sec on hard wired local clusters...125K msgs/sec doesn't even buy admission to the show. In fact, pull messaging has inferior goodput by definition and systems like Kafka usually end up with more hardware to reach the same performance level. The benefits noted above may often make it worth the cost. I am unaware of anyone using Kafka for messaging in high frequency trading, for example, where microseconds matter.
It may be interesting to note that various push-pull messaging systems were developed in the late 1990s as a way to optimize the goodput. The results were never staggering and the system complexity and other factors often outweigh this kind of optimization. I believe this is Jay's point overall about practical performance over real data center networks, not to mention things like the open Internet.
Pushing is just extra work for the broker. With Kafka, the responsibility of fetching messages is on consumers. Consumers can decide at what rate they want to process the messages.
If a broker is pushing messages and if some of the consumers are down, the broker will retry certain times to push the messages till they decide not to push anymore. This decreases performance. Imagine the workload of pushing messages to multiple consumers.

Is Kafka suitable for running a public API?

I have an event stream that I want to publish. It's partitioned into topics, continually updates, will need to scale horizontally (and not having a SPOF is nice), and may require replaying old events in certain circumstances. All the features that seem to match Kafka's capabilities.
I want to publish this to the world through a public API that anyone can connect to and get events. Is Kafka a suitable technology for exposing as a public API?
I've read the Documentation page, but not gone any deeper yet. ACLs seem to be sensible.
My concerns
Consumers will be anywhere in the world. I can't see that being a problem seeing Kafka's architecture. The rate of messages probably won't be more than 10 per second.
Is integration with zookeeper an issue?
Are there any arguments against letting subscriber clients connect that I don't control?
Are there any arguments against letting subscriber clients connect that I don't control?
One of the issues that I would consider is possible group.id collisions.
Let's say that you have one single topic to be used by the world for consuming your messages.
Now if one of your clients has a multi-node system and wants to avoid reading the same message twice, they would set the same group.id to both nodes, forming a consumer group.
But, what if someone else in the world uses the same group.id? They would affect the first client, causing it to lose messages. There seems to be no security at that level.