I am a beginner with Apache Storm and wondering when the order of tuples is guaranteed in a stream.
When I get this post right Processing records in order in Storm then the order between a Bolt/Spout and a other Bolt is guaranteed.
So if I have KaffkaSpout which emits Tuples which are ordered according to a timestamp and have some Bolts with field grouping according to some id.
builder.setBolt("Bolt1", bolt1).fieldsGrouping("Bolt1", new Fields("id"));
Is it guaranteed that tuples with an id x are always processed in order for a Bolt. So Tuple1 must be processed in Bolt1 (strictly) before Tuple2 is processed in Bolt1 if they have the same id? With strictly I mean not parallel.
Is this true even when a worker node fails?
That depends on your topology and where does "Bolt1" lie in the topology relative to the KafkaSpout. For e.g. consider the following 2 topology cases -
Case 1 -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("Bolt1", bolt1).fieldsGrouping("KafkaSpout", new Fields("id"));
In this case, since bolt1 is next in topology to kafkaSpout and with field grouping, all tuples with same "id" will go to the same bolt instance, it will be strict in order.
However consider the following topology
Case 2 -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("Bolt2", bolt2).shuffleGrouping("KafkaSpout");
builder.setBolt("Bolt1", bolt1).fieldsGrouping("Bolt2", new Fields("id")); //id field emitted by Bolt2
In this case, since the order is lost in Bolt2, there is no guarantee that the tuples would come to bolt 1 in the order they were pushed into Kafka partition.
In general, if you are looking for a strict ordering of processing in Storm system, it is your responsibility to keep all the components work and emit in order. But in general this would restrict you in many ways to use the full capabilities of Storm by restricting parallelism in your code and topology.
Related
As far as I understand both Kafka Producer and Consumer have to use a single thread per topic-partition if we want to write / read records in an order. Am I right or maybe they use multiple threads in such situations?
So the ordering can be achieved in Kafka in both single threaded as well as multithreaded env
single broker/single partition -> Single thread based consumer model
The order of message in Kafka works well for a single partition. But with a single partition, parallelism and load balancing is difficult to achieve. Please note that in this case only one thread will be used to access topic partition thus the ordering is always guaranteed.
multiple brokers/multiple partitions -> Multithread based consumers model(having consumer groups holding more than 1 consumers)
In this case, we assume that there are multiple partitions present in topic and each partition is being handled by a single consumer(precisely a single thread) in each consumer group which is fairly called multithreading.
There are three methods in which we can retain the order of messages within partitions in Kafka. Each method has its own pros and cons.
Method 1: Round Robin or Spraying
Method 2 : Hashing Key Partition
Method 3 : Custom Partitioner
Round Robin or Spraying (Default)
In this method, the partitioned will send messages to all the partitions in a round-robin fashion, ensuring a balanced server load. Over loading of any partition will not happen. By this method parallelism and load balancing is achieved but it fails to maintain the overall order but the order within the partition will be maintained. This is a default method and it is not suitable for some business scenarios.
In order to overcome the above scenarios and to maintain message ordering, let’s try another approach.
Hashing Key Partition
In this method we can create a ProducerRecord, specifying a message key with each message being passed to the topic to ensure that partition ordering will happen.
The default partitioned will use the hash of the key to ensure that all messages for the same key go to same partition. This is the easiest and most common approach. This is the same method which has been used for hive bucketing as well. It uses modulo operation for hashing.
Hash(Key) % Number of partitions -> Partition number
We can say that the key here will help to define the partition where the producer wants to send the message always to maintain the order. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition.
Custom Partitioner
We can write our own business logic to decide which message need to be send to which partition. With this approach, we can make ordering of messages as per our business logic and achieve parallelism at the same time.
For understanding more details you can check below
https://medium.com/latentview-data-services/how-to-use-apache-kafka-to-guarantee-message-ordering-ac2d00da6c22
Also Please note that this information represents the Partition level parallelism
There has been a new parallelism strategy as well called consumer level parallelism. I have not give it a read but you can find details at below confluent link
https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/
Let's say I have two producers (ProducerA and ProducerB) writing to the same topic with a single partition. Each producer is writing it's own unique events serially. So if ProducerA fired 3 events and then ProducerB fired 3 events, my understanding is that Kafka cannot guarantee the order across the producer's events like this:
ProducerA_event_1
ProducerA_event_2
ProducerA_event_3
ProducerB_event_1
ProducerB_event_2
ProducerB_event_3
due to acking, retrying, etc.
However will individual producer's events still be in order? For example:
ProducerA_event_1
ProducerB_event_2
ProducerB_event_1
ProducerA_event_2
ProducerA_event_3
ProducerB_event_3
This is of course a simplified version of what I am doing, but I just want to guarantee that if I am reading from a topic for a specific producer's events, then those events will be in order even if other producer's events interleave them.
Short answer to this one is Yes, the individual producer's events will be guaranteed to be in order.
Messages in Kafka are appended to a topic partition in the order they are sent and the consumers read the messages in the same order they are stored in the topic partition.
So assuming if you are interested in the messages from Producer A and are filtering everything else, then in the given scenario, you can expect the events 1, 2 and 3 from Producer A to be read in the order.
PS: I am however curious to understand the motivation behind using just one partition. Also, on your statement:
So if ProducerA fired 3 events and then ProducerB fired 3 events, my
understanding is that Kafka cannot guarantee the order across the
producer's events like this:
You are correct in saying that the overall ordering is something that cannot be guaranteed but ordering within a partition can be guaranteed.
I hope this helps.
There is a nice article on medium which states that Kafka does not always guarantee the message ordering even for the same producer. It all depends on the Kafka configuration. In particular, max.in.flight.requests.per.connection has to be set to 1. The reason is if there are multiple requests (say, 2) in flight and the first one failed, the second will get appended to the log earlier, thus breaking the ordering.
A producer's messages will be stored, per partition, in the order they are received. If you can guarantee message ordering on the producer, then consumers can assume ordering when polling. Retry logic, multiple KafkaProducer instances, and other asynchronous implementation details might complicate ordered message production. Often these can be mitigated by including a unique event identifier, an identifier of the producer, and a timestamp of sufficient granularity either in the key or value of the message. Relying on ordering in an asynchronous framework is often a best case flow but there should be some way to compensate when things come in out of order.
I'm new to Storm and I'm having problems to figure out how to process records in order.
I have a dataset which contains records with the following fields:
user_id, location_id, time_of_checking
Now, I would like to identify users which have fulfilled the path I specified (for example, users that went from location A to location B to location C).
I'm using Kafka producer and reading this records from a file to simulate live data. Data is sorted by date.
So, to check if my pattern is fulfilled I need to process records in order. The thing is, due to parallelization (bolt replication) I don't get check-ins of user in order. Because of that patterns won't work.
How to overcome this problem? How to process records in order?
There is no general system support for ordered processing in Storm. Either you use a different system that supports ordered steam processing like Apache Flink (Disclaimer, I am a committer at Flink) or you need to take care of it in your bolt code by yourself.
The only support Storm delivers is using Trident. You can put tuples of a certain time period (for example one minute) into a single batch. Thus, you can process all tuples within a minute at once. However, this only works if your use case allows for it because you cannot related tuples from different batches to each other. In your case, this would only be the case, if you know that there are points in time, in which all users have reached their destination (and no other use started a new interaction); ie, you need points in time in which no overlap of any two users occurs. (It seems to me, that your use-case cannot fulfill this requirement).
For non-system, ie, customized user-code based solution, there would be two approaches:
You could for example buffer up tuples and sort on time stamp within a bolt before processing. To make this work properly, you need to inject punctuations/watermarks that ensure that no tuple with larger timestamp than the punctuation comes after a punctuation. If you received a punctuation from each parallel input substream you can safely trigger sorting and processing.
Another way would be to buffer tuples per incoming substream in district buffers (within a substream order is preserved) and merge the tuples from the buffers in order. This has the advantage that sorting is avoided. However, you need to ensure that each operator emits tuples ordered. Furthermore, to avoid blocking (ie, if no input is available for a substream) punctuations might be needed, too. (I implemented this approach. Feel free to use the code or adapt it to your needs: https://github.com/mjsax/aeolus/blob/master/queries/utils/src/main/java/de/hub/cs/dbis/aeolus/utils/TimestampMerger.java)
Storm supports this use case. For this you just have to ensure that order is maintained throughout your flow in all the involved components. So as first step, in Kafka producer, all the messages for a particular user id should go to the same partition in Kafka. For this you can implement a custom Partitioner in your KafkaProducer. Please refer to the link here for implementation details.
Since a partition in Kafka can be read by one and only one kafkaSpout instance in Storm, the messages in that partition come in order in the spout instance. Thereby ensuring that all the messages of the same user id arrive to the same spout.
Now comes the tricky part - to maintain order in bolt, you want to ensure that you use field grouping on bolt based on "user_id" field emitted from the Kafka spout. A provided kafkaSpout does not break the message to emit field, you would have to override the kafkaSpout to read the message and emit a "user_id" field from the spout. One way of doing so is to have an intermediate bolt which reads the message from the Kafkaspout and emits a stream with "user_id" field.
When finally you specify a bolt with field grouping on "user_id", all messages of a particular user_id value would go to the same instance of the bolt, whatever be the degree of parallelism of the bolt.
A sample topology which work for your case could be as follow -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("FieldsEmitterBolt", FieldsEmitterBolt).shuffleGrouping("KafkaSpout");
builder.setBolt("CalculatorBolt", CalculatorBolt).fieldsGrouping("FieldsEmitterBolt", new Fields("user_id")); //user_id field emitted by Bolt2
--Beware, there could be case when all the user_id values come to the same CalculatorBolt instance if you have limited number of user_ids. This in turn would decrease the effective 'parallelism'!
-
I've a requirement where I consume delta updates of Items from Kafka Queue partitions.
Producer makes sure that all delta updates corresponding to a given item are available in same partition and in order.
Ex: I1 & I2 are items, There updates say I1-Up1, I1-Up2 will be in same partition.Similarly, I2-Up1 & I2-Up2 will be in same partition.
Hence, when my First set of Bolts receive they should get Fields grouped based on items( I1 and I2). I mean, the same bolt should receive all updates of I1 in order.
The problem is How I can specify this in Bolt right after the Spout as KafkaSpout needs to do something.
A quick search said I can write my own Kafka Scheme(implement MultiScheme), but lacks code examples.
I've written one but not sure how to make sure it does Fields Grouping.
As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?
In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.
I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.
In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.
You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.
Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8
Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.