Storm max spout pending - real-time

This is a question regarding how Storm's max spout pending works. I currently have a spout that reads a file and emits a tuple for each line in the file (I know Storm is not the best solution for dealing with files but I do not have a choice for this problem).
I set the topology.max.spout.pending to 50k to throttle how many tuples get into the topology to be processed. However, I see this number not having any effect in the topology. I see all records in a file being emitted every time. My guess is this might be due to a loop I have in the nextTuple() method that emits all records in a file.
My question is: Does Storm just stop calling nextTuple() for the Spout task when topology.max.spout.pending is reached? Does this mean I should only emit one tuple every time the method is called?

Exactly! Storm can only limit your spout with the next command, so if you transmit everything when you receive the first next, there is no way for Storm to throttle your spout.
The Storm developers recommend emitting a single tuple with a single next command. The Storm framework will then throttle your spout as needed to meet the "max spout pending" requirement. If you're emitting a high number of tuples, you can batch your emits to at most a tenth of your max spout pending, to give Storm the chance to throttle.

Storm topologies have a max spout pending parameter. The max
spout pending value for a topology can be configured via the
“topology.max.spout.pending” setting in the topology
configuration yaml file. This value puts a limit on how many
tuples can be in flight, i.e. have not yet been acked or failed, in a
Storm topology at any point of time. The need for this parameter
comes from the fact that Storm uses ZeroMQ to dispatch
tuples from one task to another task. If the consumer side of
ZeroMQ is unable to keep up with the tuple rate, then the
ZeroMQ queue starts to build up. Eventually tuples timeout at the
spout and get replayed to the topology thus adding more pressure
on the queues. To avoid this pathological failure case, Storm
allows the user to put a limit on the number of tuples that are in
flight in the topology. This limit takes effect on a per spout task
basis and not on a topology level.(source) For cases when the spouts are
unreliable, i.e. they don’t emit a message id in their tuples, this
value has no effect.
One of the problems that Storm users continually face is in
coming up with the right value for this max spout pending
parameter. A very small value can easily starve the topology and
a sufficiently large value can overload the topology with a huge
number of tuples to the extent of causing failures and replays.
Users have to go through several iterations of topology
deployments with different max spout pending values to find the
value that works best for them.

One solution is to build the input queue outside the nextTuple method and the only thing to do in nextTuple is to poll the queue and emit. If you are processing multiple files, your nextTuple method should also check if the result of polling the queue is null, and if yes, atomically reset the source file that is populating your queue.

Related

Delayed packet consumptions in kafka

is it possible to pick the packets by consumers after defined time in the packet by kafka consumer or how can we achieve this in kafka?
Found related question, but it didn't help. As I see: Kafka is based on sequential reads from file system and can be used only to read topics straightforward keeping message ordering. Am I right?
same is possible with rabbitMQ.
If I understand the question, you would need to consume the data, deserialize it and inspect the time field. Then append to some priority queue data structure and start a background timer thread to check if events from this queue should further be processed, and not block the Kafka consumer.
The only downside to this approach is that you then need to worry about processing and committing "shorter time" events that are read by the consumer while waiting for previously consumed "longer time". Otherwise, a restart of your client will drop all events from an in memory queue and start consuming after the last committed record.
You might be able to workaround this using a persistent "outbox pattern" database table, or otherwise tracking offsets and processed records manually, and seeking past any duplicates

Kafka Streams: event-time skew when processing messages from different partitions

Let's consider a topic with multiple partitions and messages written in event-time order without any particular partitioning scheme. Kafka Streams application does some transformations on these messages, then groups by some key, and then aggregates messages by an event-time window with the given grace period.
Each task could process incoming messages at a different speed (e.g., because running on servers with different performance characteristics). This means that after groupBy shuffle, event-time ordering will not be preserved between messages in the same partition of the internal topic when they originate from different tasks. After a while, this event-time skew could become larger than the grace period, which would lead to dropping messages originating from the lagging task.
Increasing the grace period doesn't seem like a valid option because it would delay emitting the final aggregation result. Apache Flink handles this by emitting the lowest watermark on partitions merge.
Should it be a real concern, especially when processing large amounts of historical data, or do I miss something? Does Kafka Streams offer a solution to deal with this scenario?
UPDATE My question is not about KStream-KStream joins but about single KStream event-time aggregation preceded by a stream shuffle.
Consider this code snippet:
stream
.mapValues(...)
.groupBy(...)
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).grace(Duration.ofSeconds(10)))
.aggregate(...)
I assume mapValues() operation could be slow for some tasks for whatever reason, and because of that tasks do process messages at a different pace. When a shuffle happens at the aggregate() operator, task 0 could have processed messages up to time t while task 1 is still at t-skew, but messages from both tasks end up interleaved in a single partition of the internal topic (corresponding to the grouping key).
My concern is that when skew is large enough (more than 10 seconds in my example), messages from the lagging task 1 will be dropped.
Basically, a task/processor maintains a stream-time which is defined as the highest timestamp of any record already polled. This stream-time is then used for different purpose in Kafka Streams (e.g: Punctator, Windowded Aggregation, etc).
[Windowed Aggregation]
As you mentioned, the stream-time is used to determine if a record should be accepted, i.e record_accepted = end_window_time(current record) + grace_period > observed stream_time.
As you described it, if several tasks run in parallel to shuffle messages based on a grouping key, and some tasks are slower than others (or some partitions are offline) this will create out-of-order messages. Unfortunately, I'm afraid that the only way to deal with that is to increase the grace_period.
This is actually the eternal trade-off between Availability and Consistency.
[Behaviour for KafkaStream and KafkaStream/KTable Join
When you are perfoming a join operation with Kafka Streams, an internal Task is assigned to the "same" partition over multiple co-partitioned Topics. For example the Task 0 will be assigned to Topic1-Partition0 and TopicB-Partition0.
The fetched records are buffered per partition into internal queues that are managed by Tasks. So, each queue contains all records for a single partition waiting for processing.
Then, records are polled one by one from queues and processed by the topology instance. But, this is the record from the non-empty queue having the lowest timestamp which is returned from the polled.
In addition, if a queue is empty, the task may become idle during a period of time so that no more records are polled from queue. You can actually configure the maximum amount of time a Task will stay idle can be defined with the stream config :max.task.idle.ms
This mecanism allows synchronizing co-localized partitions. Bu, default the max.task.idle.ms is set to 0. This means a Task will never wait for more data from a partition which may lead to records being filtered because the stream-time will potentially increase more quickly.

Maximum spout capacity

I'm using Heron for performing streaming analytics on IoT data. Currently in the architecture there is only one spout with parallelism factor 1.
I'm trying to benchmark the stats on the amount of data Heron can hold in the queue which it internally uses at spout.
I'm playing around with the method setMaxSpoutPending() by passing value to it. I want to know if there is any limit on the number which we pass to this method?
Can we tweak the parameter method by increasing system configuration or providing more resource to the topology?
So if you have one spout and one bolt, then max spout pending is the best way to control the number of pending tuples.
Max Spout pending can be increased indefinitely. However increasing it beyond a certain amount increases the probability of timeout errors happening and in the worst case there could be no forward progress. Also higher msp typically require more heap required for spout and other components of the topology.
MSP is used to control the topology ingestion rate; it tells Storm the maximum number of tuples that may be unacknowledged at any given time. If the MSP is lower than the parallelism of the topology, it can be a bottle neck. On the other hand, increasing MSP beyond the topology parallelism level can lead to the topology being 'flooded' and unable to keep up with the inbound tuples. In such a situation the 'message timeout' of the topology will be exceeded and Storm will attempt to replay them while still feeding new tuples. Storm will stop feeding new inbound tuples only when the MSP limit is reached.
So yes, you can tweak it but keep an eye out for increasing timed out tuples indicating that your topology is overwhelmed.
BTW, if you're processing IoT events you may be able to increase parallelism by grouping the spout tuples by the device id (tuple stream per device) using field grouping.

storm failing to process all the tuples

I'm using Apache Storm to process huge data coming off a Kafka spout. Currently, there are over 3k json messages already published to Kafka and it's continuing. I have to process all the messages published from beginning. So, I have set a Kafka spout parameter accordingly.
This results in a lot of failures in tuple processing. I got this info from the storm UI.
I suspect the storm is not able to handle all the messages bombarded towards it in a single shot.
Any help is appreciated.
1) increase the parallelism hint for the bolts so that there's no backlog slowing down the processing for any tuple emitted by the spout, or
2) use the topology.max.spout.pending property to limit the number of tuples the spout can emit before having to wait for one of those tuples to complete.
try combination of both solutions. In production usually you need to run many iterations to get proper value of both the values (parallelism,topology.max.spout.pending)

Storm Ack method

Currently I have a simple topology. It has a spout to read data in, a bolt to transform the data and a bolt to store the data in a data store. I have setup anchors in order to get reply on any tuples that fail. Due to the way I read data in, there is only one spout but multiple transformer and store bolts. Now, what I have noticed is when the number of tuples in the cluster gets large the topology runs very slow. When I removed the acking everything went much faster. So I tried to increase the number of ackers but I noticed that there was still only one acker thread tracker all the tuples. Is it always just one acker thread per spout to track the tuples?