Messaging platform with QoS / Kafka partition overloading - apache-kafka

I'm having a recurrent issue with Kafka: I partition messages by customer id, and sometimes it happens that a customer gets a huge amount of messages. As a result, the messages of this customer and all other customers in the same partition get delayed.
Are there well-known ways to handle this issue? Possibly with other messaging platforms?
Ideally, only the messages of one customer would be delayed. Other customer's messages would get an equal share of consumers' bandwidth.
Note: I must partition by customer id, because I want to consume the messages of any given custom in order. However, I can consume the messages of two customers in any order.

I will try and answer based on the limited information porovided.
Kafka partitoins are the smalles unit of scalability, so for example, if you have 10 parallel consumers (kafka topic listeners) you should partiton your topic by this number or higher otherwise, some of your listeners will bet starved as kafka manage the consumers in a way that only one consumer will be getting messages from a partiton. This is to protect the partiton from mixing messages order. The other way is supported as consumers can handle more than one partiton at a time.
My design solution will be to decide how much capacity are you planning to allocate for the consumers (microservices) instances? This number will guide you to the right number of partitons.
I would avoid using a dynamic number of partitons as this does not scale well. Use the number that match the capacity you plan to allocate and some extra spare in the case you need to scale up in the future. Let's say tomorrow you have 5 new customers, adding partitons is not easy or wise.
Kafka will make sure the messages stay in order per partition so this is free for your use case. What you need is on the consumer end to be able to handle the different customer ID messages in the right order. To avoid messages to the same customer get mixed order your partiton must be a higher level category of customers, I can think of customer type/region/size ... The idea is that all of a single customer messages stay in the same topic.
Your partitoin key must relate to the size of messages/data so your messages spread eavenly over your kafka cluster. This helps with the kafka cluster scale & redundency itself.
deciding on the right partitioning strategy is hard but it is worth the time spent on planning it.
One design solution come up a lot is hashing. Map a partition number using a HASH from customer ID to a partiton key. Again, decide on a fixed partiton number and let the HASH map the customer ID to your partiton key.
using X modulo partitions
X customers have a lot of messages and you need to have one topic per customer. so in this case you map a customer per topic so your modulo will be the number of these customers.
Y customers are low trafic customers, for these customers use a different modulo of Y/5 for example so you have 5 customers sharing a topic.
make sure you add the X partiton number to the Y partition number so you dont overlap.
the only issue I see is this is not flexible, you cannot change the mapping if the number of customers changes. You might allow more topics in each group to support future partitons.

Related

Too much load on kafka partition (extra load on single partition)

Suppose, there is trade capture application. The application consumes message via kafka. The partition id is stock-id (eg google, apple, tesla). This works fine in normal days. Suppose there is bad news for company and people are selling stocks for X company. Then in this case, all the messages would come to single partition during that trading session or day. How do i handle, this efficiently? Can we apply multiple consumers to same partition?
its due to overloaded partition on random day. We have more than dozens of partitions along with dozens of consumers. All the partitions/ consumers are equally distributed everytime throughout the year. Its when there is sudden spike of data in single partition which happens once in month or quarterly .
Can we apply multiple consumers to same partition
Not in the same consumer group, no.
The only way to reasonably handle this is to increase max.poll.records and other consumer properties to consume faster from that partition, and/or all partitions. Unfortunately, you won't know ahead of time which partition will get "overloaded".
The other alternative is to redesign your topic(s) such that "stock tickers" are not your partition ID and whatever you do choose as your partitioning strategy is not driven by end-user behavior that is out of your control (or otherwise define your own Paritioner class).

Should I create more topics or more partitions?

Kafka gets orders from others countries.
I need to group these orders by countries. Should I create more topics with country name or about to have one topic with different partitions?
Another was it to have one topic and use strean Kafka that filters orders and sends to specific country topic?
What is better if anmount of countries is over 180?
I want distribute orders across executers who is placed in specific country/city.
Remark:
So, order has data about country/city. Then Kafka must find executers in this country/city and send them the same order.
tl;dr
In your case, I would create one topic countries and use the country_id or country_name as the message key so that messages for the same country, are placed in the same partition. In this way, each partition will contain information for specific country (or countries - it depends).
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.

Can Kafka scale to tens of millions of topics?

I'm currently planning the development of a Device Server and am keen to use Kafka, however, I'm unsure if it's capable of supporting a paradigm where there is one topic per device, when there could be 10 million+ devices.
I would expect only one partition per topic and limited required storage (<1MB) per topic. If it makes any difference one topic with millions of partitions could also be considered.
Is anyone able to provide clarification of the scaling limits and expectations of Kafka at this level? In particular, I'm keen to understand the overheads per topic and the effectiveness/feasibility of an individual consumer consuming from ~10k subscribed topics over a single connection.
Any advice much appreciated,
Many thanks
Kafka best practices would be to use keys rather than topics for that many devices. Kafka scales to an unlimited number of keys but not an unlimited number of topics
Having one topic with many partitions has some advantages. First of all you can use keys, as already said, for specifying the device which is sending the message. You don't need to have the number of partitions equals to the number of devices but it could be less that that; thanks to the key usage, the main aspect is that messages from same device (same key) go always to the same partition and in order. On the consumer side you have the advantage to leverage on more consumers in the same consumer group working on different partitions and sharing the messages load; you can scale up to a number of consumers equals to the number of partitions.

Merging ordered Kafka topics into a single ordered topic

I have N topics as input, each with messages added in ascending delivery date order. Topics can vary widely in message count, date range, partitioning strategy. But I know that all partitions for every topic will independently be in date order.
I want to merge all N topics priority-queue style into a new single topic T. T also has whatever partition count and strategy it wants since the only requirement is that each individual partition of T is still in date order on its own. I then feed T to partition-aware consumers which will consume them and idle between due dates since I want each message to be delivered on or closely thereafter its delivery date. This whole pipeline can stream forever.
I expect tuning issues with exactly how partitions amongst all the N input topics and the single T output topic are distributed, and advice which affects that specifically is welcome but right now I'm mainly interested in the overall viability of doing this at all using only Kafka topics, not a RDB or Key-value store. So some extra I/O moving messages between non-optimal topic partitions is okay.
Is this doable with the 0.9 consumer where I can control knowing which partitions are assigned to each consumer, so I can let auto-rebalancing occur while endlessly peek/merge-to-T/commit-offset the oldest message on each actual partition? I must have partition awareness to have a chance of this working.
Due to needing shared merge state (the last date added to T), is it better to stick with multiple partition-aware consumers in a single process, parallel processes or multiple servers given where that state will need to be? I favor keeping the state onboard in shared memory not networked in ZK or whatever. On a restart I can get it once and maintain it while running if on a single machine.
Am I overlooking any Kafka features that would make what I describe easier or more efficient, like some atomic message move between topics? I know I am going against the grain of its design and this scenario is similar to TS.

Apache Kafka order of messages with multiple partitions

As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?
In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.
I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.
In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.
You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.
Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8
Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.