I'm using Apache Storm to process huge data coming off a Kafka spout. Currently, there are over 3k json messages already published to Kafka and it's continuing. I have to process all the messages published from beginning. So, I have set a Kafka spout parameter accordingly.
This results in a lot of failures in tuple processing. I got this info from the storm UI.
I suspect the storm is not able to handle all the messages bombarded towards it in a single shot.
Any help is appreciated.
1) increase the parallelism hint for the bolts so that there's no backlog slowing down the processing for any tuple emitted by the spout, or
2) use the topology.max.spout.pending property to limit the number of tuples the spout can emit before having to wait for one of those tuples to complete.
try combination of both solutions. In production usually you need to run many iterations to get proper value of both the values (parallelism,topology.max.spout.pending)
Related
We have a requirement to consume from a Kafka Topic. The Topic is provided by the producer team and we have no control on them. The producer publishes huge amount of messages which our consumer is unable to consume. However we only require 5-10% of the volume produced. Currently in Consumer we deserialize the message and based on certain attributes drop 90-95% of the messages. The consumer is behind 5-10L messages most of the time during the day. We even tried with 5 consumer and 30 threads in each consumer but not much success.
Is there any way we can subscribe Consumer to the Topic with some filter criteria so we only receive messages we are interested in.
Any help or guidance would be highly appreciated.
It is not possible to filter messages without consuming and even partially deserializing all of them.
Broker-Side filtering is not supported, though it has been discussed for a long time (https://issues.apache.org/jira/browse/KAFKA-6020)
You mentioned that you do not control the producer. However, if you can get the producer to add the attribute you filter by to a message header, you can save yourself the parsing of the message body. You still need to read all the messages, but the parsing can be CPU intensive, so skipping that helps with lag.
We are trying to benchmark the performance in our Storm Topology. We are ingesting messages around 1000/seconds to Kafka Topic. When we put max.spout.pendind=2000 in our KafkaSpout then we don't see any failed messages in storm UI but when we decrease the max.spout.pendind value to 500 or 100, then we see lot of failed messages in spout in Storm UI. My understanding is that if we keep the max.spout.pending low then we will not have any failed messages as nothing will timeout but it behaving in opposite manner. We are using Storm 1.1.0 version from HDP 2.6.5 version.
We have one Kafka Spout and two bolts .
KafkaSpout Parallelism - 1
Processing Bolt Parallelism - 1
Custom Kafka Writer Bolt Parallelism - 1
Could anyone have any idea about this?
The first you will have to do is on the storm UI check the statistics of the latency. You should also look at how the bolts/spouts are loaded (capacity statistics).
Is the rate of emit of tuples really high compared to the rate of sinking this data ? , This is an indication that i get when you mention that increasing pending spouts is fixing the issue. Can you provide these stats .. Another part worth exploring is increasing the task time out on the tuples (to see if this is causing replay and flooding the topology)
Please find the below topology stats :
This is interesting. You are right, follow my steps to narrow down the issue,
Upload a screenshot of your Topology Visualization screen on peek load.
Check for the bolts which are changing their color to brown/red. Red is indicating that your bolt takes too much time to process records.
Your spout/bolt executors are way less to process 1K tuples per second.
Number of machines you are using?
If tuples are failed in "KafkaSpout" then most of the time it means timeout error.
Find out after processing how many events tuples are failing.
I am new to Storm and have few basic questions. My use case for storm is both stream and batch processing.
Use Case #1: Storm topology takes in the tuples as stream and processes it.
Use Case #2: Storm topology should take in the tuples as a batch of tuples and process it.
I'm using Kafka as the queuing mechanism to feed Storm topology.
Question: Is there a way, where I can tell that the a particular tuple is the end of the stream and storm should tell me when the processing of all the tuples is finished?
Is Storm not the correct framework to do this, as it is meant for stream processing (Use case #1). Would Storm Trident help for use case #2?
You cannot tell Storm, that a tuple is the last of a stream. However, if you know that you just emitted the last tuple from your Spout, you can set an internal flag for yourself and furthermore wait until you received all acks in the Spout. When all acks are received, you know that all tuples got processed completely by Storm.
For question 2, it is not clear to me, what you mean by "do the same processing"? It seems, you want to process the same data twice in different modes (or did I understand something wrong)? Why do you distinguish between "stream" and "batch" case? What is the different semantics you want to get? And what do you mean by "take in the tuples as a batch of tuples". Do you know that you have an finite data stream? Do you want to put all tuples into a single batch? Or do you want to do some micro-batching?
For micro-batching, Trident would be useful. If you have a real batch job, Storm is no good fit. For this, you might want to check out Apache Flink (disclaimer, I am a committer to Flink) or Apache Spark. Both are hybrid systems supporting batch and streaming. Depending on your needs on streaming semantics the one or other might be the better fit. Spark provides micro-batching to emulate streaming while Flink does real streaming.
Currently I have a simple topology. It has a spout to read data in, a bolt to transform the data and a bolt to store the data in a data store. I have setup anchors in order to get reply on any tuples that fail. Due to the way I read data in, there is only one spout but multiple transformer and store bolts. Now, what I have noticed is when the number of tuples in the cluster gets large the topology runs very slow. When I removed the acking everything went much faster. So I tried to increase the number of ackers but I noticed that there was still only one acker thread tracker all the tuples. Is it always just one acker thread per spout to track the tuples?
This is a question regarding how Storm's max spout pending works. I currently have a spout that reads a file and emits a tuple for each line in the file (I know Storm is not the best solution for dealing with files but I do not have a choice for this problem).
I set the topology.max.spout.pending to 50k to throttle how many tuples get into the topology to be processed. However, I see this number not having any effect in the topology. I see all records in a file being emitted every time. My guess is this might be due to a loop I have in the nextTuple() method that emits all records in a file.
My question is: Does Storm just stop calling nextTuple() for the Spout task when topology.max.spout.pending is reached? Does this mean I should only emit one tuple every time the method is called?
Exactly! Storm can only limit your spout with the next command, so if you transmit everything when you receive the first next, there is no way for Storm to throttle your spout.
The Storm developers recommend emitting a single tuple with a single next command. The Storm framework will then throttle your spout as needed to meet the "max spout pending" requirement. If you're emitting a high number of tuples, you can batch your emits to at most a tenth of your max spout pending, to give Storm the chance to throttle.
Storm topologies have a max spout pending parameter. The max
spout pending value for a topology can be configured via the
“topology.max.spout.pending” setting in the topology
configuration yaml file. This value puts a limit on how many
tuples can be in flight, i.e. have not yet been acked or failed, in a
Storm topology at any point of time. The need for this parameter
comes from the fact that Storm uses ZeroMQ to dispatch
tuples from one task to another task. If the consumer side of
ZeroMQ is unable to keep up with the tuple rate, then the
ZeroMQ queue starts to build up. Eventually tuples timeout at the
spout and get replayed to the topology thus adding more pressure
on the queues. To avoid this pathological failure case, Storm
allows the user to put a limit on the number of tuples that are in
flight in the topology. This limit takes effect on a per spout task
basis and not on a topology level.(source) For cases when the spouts are
unreliable, i.e. they don’t emit a message id in their tuples, this
value has no effect.
One of the problems that Storm users continually face is in
coming up with the right value for this max spout pending
parameter. A very small value can easily starve the topology and
a sufficiently large value can overload the topology with a huge
number of tuples to the extent of causing failures and replays.
Users have to go through several iterations of topology
deployments with different max spout pending values to find the
value that works best for them.
One solution is to build the input queue outside the nextTuple method and the only thing to do in nextTuple is to poll the queue and emit. If you are processing multiple files, your nextTuple method should also check if the result of polling the queue is null, and if yes, atomically reset the source file that is populating your queue.