I'm new to Storm and I'm having problems to figure out how to process records in order.
I have a dataset which contains records with the following fields:
user_id, location_id, time_of_checking
Now, I would like to identify users which have fulfilled the path I specified (for example, users that went from location A to location B to location C).
I'm using Kafka producer and reading this records from a file to simulate live data. Data is sorted by date.
So, to check if my pattern is fulfilled I need to process records in order. The thing is, due to parallelization (bolt replication) I don't get check-ins of user in order. Because of that patterns won't work.
How to overcome this problem? How to process records in order?
There is no general system support for ordered processing in Storm. Either you use a different system that supports ordered steam processing like Apache Flink (Disclaimer, I am a committer at Flink) or you need to take care of it in your bolt code by yourself.
The only support Storm delivers is using Trident. You can put tuples of a certain time period (for example one minute) into a single batch. Thus, you can process all tuples within a minute at once. However, this only works if your use case allows for it because you cannot related tuples from different batches to each other. In your case, this would only be the case, if you know that there are points in time, in which all users have reached their destination (and no other use started a new interaction); ie, you need points in time in which no overlap of any two users occurs. (It seems to me, that your use-case cannot fulfill this requirement).
For non-system, ie, customized user-code based solution, there would be two approaches:
You could for example buffer up tuples and sort on time stamp within a bolt before processing. To make this work properly, you need to inject punctuations/watermarks that ensure that no tuple with larger timestamp than the punctuation comes after a punctuation. If you received a punctuation from each parallel input substream you can safely trigger sorting and processing.
Another way would be to buffer tuples per incoming substream in district buffers (within a substream order is preserved) and merge the tuples from the buffers in order. This has the advantage that sorting is avoided. However, you need to ensure that each operator emits tuples ordered. Furthermore, to avoid blocking (ie, if no input is available for a substream) punctuations might be needed, too. (I implemented this approach. Feel free to use the code or adapt it to your needs: https://github.com/mjsax/aeolus/blob/master/queries/utils/src/main/java/de/hub/cs/dbis/aeolus/utils/TimestampMerger.java)
Storm supports this use case. For this you just have to ensure that order is maintained throughout your flow in all the involved components. So as first step, in Kafka producer, all the messages for a particular user id should go to the same partition in Kafka. For this you can implement a custom Partitioner in your KafkaProducer. Please refer to the link here for implementation details.
Since a partition in Kafka can be read by one and only one kafkaSpout instance in Storm, the messages in that partition come in order in the spout instance. Thereby ensuring that all the messages of the same user id arrive to the same spout.
Now comes the tricky part - to maintain order in bolt, you want to ensure that you use field grouping on bolt based on "user_id" field emitted from the Kafka spout. A provided kafkaSpout does not break the message to emit field, you would have to override the kafkaSpout to read the message and emit a "user_id" field from the spout. One way of doing so is to have an intermediate bolt which reads the message from the Kafkaspout and emits a stream with "user_id" field.
When finally you specify a bolt with field grouping on "user_id", all messages of a particular user_id value would go to the same instance of the bolt, whatever be the degree of parallelism of the bolt.
A sample topology which work for your case could be as follow -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("FieldsEmitterBolt", FieldsEmitterBolt).shuffleGrouping("KafkaSpout");
builder.setBolt("CalculatorBolt", CalculatorBolt).fieldsGrouping("FieldsEmitterBolt", new Fields("user_id")); //user_id field emitted by Bolt2
--Beware, there could be case when all the user_id values come to the same CalculatorBolt instance if you have limited number of user_ids. This in turn would decrease the effective 'parallelism'!
Related
I would like to implement in Apache Flink the following scenario:
Given a Kafka topic having 4 partitions, I would like to process the intra-partition data independently in Flink using different logics, depending on the event's type.
In particular, suppose the input Kafka topic contains the events depicted in the previous images. Each event have a different structure: partition 1 has the field "a" as key, partition 2 has the field "b" as key, etc. In Flink I would like to apply different business logics depending on the events, so I thought I should split the stream in some way. To achieve what's described in the picture, I thought to do something like that using just one consumer (I don't see why I should use more):
FlinkKafkaConsumer<..> consumer = ...
DataStream<..> stream = flinkEnv.addSource(consumer);
stream.keyBy("a").map(new AEventMapper()).addSink(...);
stream.keyBy("b").map(new BEventMapper()).addSink(...);
stream.keyBy("c").map(new CEventMapper()).addSink(...);
stream.keyBy("d").map(new DEventMapper()).addSink(...);
(a) Is it correct? Also, if I would like to process each Flink partition in parallel, since I'm just interested to process in-order the events sorted by the same Kafka partition, and not considering them globally, (b) how can I do? I know the existence of the method setParallelism(), but I don't know where to apply it in this scenario.
I'm looking for an answer about questions marked (a) and (b). Thank you in advance.
If you can build it like this, it will perform better:
Specifically, what I'm proposing is
Set the parallelism of the entire job to exactly match the number of Kafka partitions. Then each FlinkKafkaConsumer instance will read from exactly one partition.
If possible, avoid using keyBy, and avoid changing the parallelism. Then the source, map, and sink will all be chained together (this is called operator chaining), and no serialization/deserialization and no networking will be needed (within Flink). Not only will this perform well, but you can also take advantage of fine-grained recovery (streaming jobs that are embarrassingly parallel can recover one failed task without interrupting the others).
You can write a general purpose EventMapper that checks to see what type of event is being processed, and then does whatever is appropriate. Or you can try to be clever and implement a RichMapFunction that in its open() figures out which partition is being handled, and loads the appropriate mapper.
Hi everybody I have a question about TimestampExtractor and Kafka Streams....
In our application there is a possibility of receiving out-of-order events, so I like to order the events depending on a business date inside of the payload instead in point of time they placed in the topic.
For this purpose I programmed a custom TimestampExtractor to be able to pull the timestamp from the payload. Everything until I told here worked perfectly but when I build the KTable to this topic, I discerned that the event that I receive out-of-order (from Business point of view it is not last event but it received at the end) displayed as last state of the object while ConsumerRecord having the timestamp from the payload.
I don't know may be it was my mistake to assume Kafka Stream will fix this out-of-order problem with TimestampExtractor.
Then during debugging I saw that if the TimestampExtractor returns -1 as result Kafka Streams are ignoring the message and TimestampExtractor also delivering the timestamp of the last accepted Event, so I build a logic that realise the following check (payloadTimestamp < previousTimestamp) return -1, which achieves the logic I want but I am not sure I am sailing on dangerous waters or not.
Am I allowed to deal with a logic like this or what other ways exist to deal with out-of-order events in Kafka streams....
Thx for answers..
Currently (Kafka 2.0), KTables don't consider timestamps when they are updated, because the assumption is, that there is no out-of-order data in the input topic. The reason for this assumption is the "single writer principle" -- it's assumed, that for compacted KTable input topic, there is only one producer per key, and thus, there won't be any out-of-order data with regard to single keys.
It's a know issue: https://issues.apache.org/jira/browse/KAFKA-6521
For your fix: it's not 100% correct or safe to do this "hack":
First, assume you have two different messages with two different key <key1, value1, 5>, <key2, value2, 3>. The second record with timestamp 3 is later, compared to the first record with timestamp 5. However, both have different keys and thus, you actually want to put the second record into the KTable. Only if you have two record with the same key, you want to drop late arriving data IHMO.
Second, if you have two records with the same key and the second one if out-of-order and you crash before processing the second one, the TimestampExtractor looses the timestamp of the first record. Thus on restart, it would not discard the out-of-order record.
To get this right, you will need to filter "manually" in your application logic instead of the stateless and key-agnostic TimestampExtractor. Instead of reading the data via builder#table() you can read it as a stream, and apply an .groupByKey().reduce() to build the KTable. In you Reducer logic, you compare the timestamp of the new and old record and return the record with the larger timestamp.
I wonder if there's any way to sort records within a window using Kafka Streams DSL or Processor API.
Imagine the following situation as an example (arbitrary one, but similar to what I need):
There is a Kafka topic of some events, let's say user clicks. Let's say topic has 10 partitions. Messages are partitioned by key, but each key is unique, so it's sort of a random partitioning. Each record contains a user id, which is used later to repartition the stream.
We consume the stream, and publish each message to another topic partitioning the record by it's user id (repartition the original stream by user id).
Then we consume this repartitioned stream, and we store consumed records in local state store windowed by 10 minutes. All clicks of a particular user are always in the same partition, but order is not guarantied, because the original topic had 10 partitions.
I understand the windowing model of Kafka Streams, and that time is advanced when new records come in, but I need this window to use processing time, not the event time, and then when window is expired, I need to be able to sort buffered events, and emit them in that order to another topic.
Notice:
We need to be able to flush/process records within the window using processing time, not the event time. We can't wait for the next click to advance the time, because it may never happen.
We need to remove all the records from the store, as soon window is sorted and flushed.
If application crashes, we need to recover (in the same or another instance of the application) and process all the windows that were not processed, without waiting for new records to come for a particular user.
I know Kafka Streams 1.0.0 allows to use wall clock time in Processing API, but I'm not sure what would be the right way to implement what I need (more importantly taking into account the recovery process requirement described above).
You can see my answer to a similar question here:
https://stackoverflow.com/a/44345374/7897191
Since your message keys are already unique you can ignore my comments about de-duplication.
Now that KIP-138 (wall-clock punctuation semantics) has been released in 1.0.0 you should be able to implement the outlined algorithm without issues. It uses the Processor API. I don't know of a way of doing this with only the DSL.
I am a beginner with Apache Storm and wondering when the order of tuples is guaranteed in a stream.
When I get this post right Processing records in order in Storm then the order between a Bolt/Spout and a other Bolt is guaranteed.
So if I have KaffkaSpout which emits Tuples which are ordered according to a timestamp and have some Bolts with field grouping according to some id.
builder.setBolt("Bolt1", bolt1).fieldsGrouping("Bolt1", new Fields("id"));
Is it guaranteed that tuples with an id x are always processed in order for a Bolt. So Tuple1 must be processed in Bolt1 (strictly) before Tuple2 is processed in Bolt1 if they have the same id? With strictly I mean not parallel.
Is this true even when a worker node fails?
That depends on your topology and where does "Bolt1" lie in the topology relative to the KafkaSpout. For e.g. consider the following 2 topology cases -
Case 1 -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("Bolt1", bolt1).fieldsGrouping("KafkaSpout", new Fields("id"));
In this case, since bolt1 is next in topology to kafkaSpout and with field grouping, all tuples with same "id" will go to the same bolt instance, it will be strict in order.
However consider the following topology
Case 2 -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("Bolt2", bolt2).shuffleGrouping("KafkaSpout");
builder.setBolt("Bolt1", bolt1).fieldsGrouping("Bolt2", new Fields("id")); //id field emitted by Bolt2
In this case, since the order is lost in Bolt2, there is no guarantee that the tuples would come to bolt 1 in the order they were pushed into Kafka partition.
In general, if you are looking for a strict ordering of processing in Storm system, it is your responsibility to keep all the components work and emit in order. But in general this would restrict you in many ways to use the full capabilities of Storm by restricting parallelism in your code and topology.
We have multiple input topics with different business events (page views, clicks, scroll events etc). As far as I understood Kafka streams they all get an event timestamp, which can be used for KStream joins with other streams or tables to align the times.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
This should by possible by using groupByKey and then aggregate/reduce (specifying the Inactivity time here) on a stream containing all events. This combined stream must have all events from the different input topics in an order of the event time (or in a way that the above kafka streams methods honor this event times).
The only challenge that is left, is to create this combined / merged stream.
When I look at the Kafka Streams API, there is the KStreamBuilder#merge operation for which the javadoc says: There is no ordering guarantee for records from different {#link KStream}s.. Does this mean the Session Windowing will produce incorrect results?
If yes, what is the alternative to #merge?
I was also thinking about joining, but in fact it seems to depend if you have one event per topic per ID, or potentially multiple events with the same ID within one input topic. For the first case, joining is a good strategy but not for the later, as you would get some unnecessary duplication.
stream A: <a,1> <a,2>
stream B: <a,3>
join-output plus session: <a,1-3 + 2-3>
Number 3 would be a duplicate.
Also keep in mind, that joining slightly modifies the time stamps and thus your session windows might be different if you apply them on the join result or on the raw data.
About merge() and ordering. You can use merge() safely as the session windows will be build based on record timestamp and not offset-order. And all window operations in Kafka Streams can handle out-of-order data gracefully.
What we want to do is: Merge all different events (originating from the above mentioned different topics) for a user id (i.e. group by user id) and apply a session window to them.
From what I understand, you'd need to join the streams (and use groupBy to ensure that they can be properly joined by user id), not merge them. You can then follow-up with an session-windowed aggregation.