Currently I have a simple topology. It has a spout to read data in, a bolt to transform the data and a bolt to store the data in a data store. I have setup anchors in order to get reply on any tuples that fail. Due to the way I read data in, there is only one spout but multiple transformer and store bolts. Now, what I have noticed is when the number of tuples in the cluster gets large the topology runs very slow. When I removed the acking everything went much faster. So I tried to increase the number of ackers but I noticed that there was still only one acker thread tracker all the tuples. Is it always just one acker thread per spout to track the tuples?
Related
The question:
How can I randomly fetch an old chunk of messages with a given range definition of [partition, start offset, end offset]. Hopefully ranges from multiple partitions at once (one range for each partition). This needs to be supported in a concurrent environment too.
My ideas for solution so far
I guess I can use a pool of consumers for the concurrency, and for each fetch, use Consumer.seek and Consumer.poll with max.poll.records. But this seems wrong. No promise that I will get the same exact chunk, for example in a case when a message get deleted (using log compact). As a whole this seek + poll method not seems like the right fit for one time random fetch.
My use case:
Like the typical consumer, mine reads 10MB chunks of messages and processes it.
In order to process that chunk I am pushing 3-20 jobs to different topics, in some kind of workflow.
Now, my goal is to avoid pushing the same chunk into the other topics again and again. Seems to me that it is better to push a reference to that chunk. e.g. [Topic X / partition Y, start offset, end offset]. Then, on the processing of the jobs, it will fetch the exact chunk again.
Your idea seems fine, and is practically the only solution with the Consumer API. There's nothing you can do once messages are removed between offsets.
If you really needed every single message between each and every possible offset range, then you should consider consuming that data as it's actively produced into some externally indexable destination where offset scans are also a common operation. Plenty of Kafka Connectors exist, and lots of databases or filesystems. But the takeaway here is that, I think you might have to reconsider your options for these "reprocessing" jobs
I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon
Imagine a scenario where we have 3 partitions belonging to 3 different topics on a machine which runs a kafka process/broker. This broker will receive messages for all three partitions. It will store them on different log subdirectories. My question is how does the kafka broker schedule these writes? How does it decide which partition/topic will be written next?
For ordering over requests, the image below shows roughly, how the broker internally handles produce requests:
There is a number of network threads that pull bytes of the network layer and convert these to internal requests. These requests are then stuck in a fifo request queue, from where the io threads pull them and append the contained messages to the relevant partitions. So in short messages are processed in the order they are received in.
Looking through the code I am unsure, whether there may be potential for a race condition here, where a smaller request could "overtake" a large request that was sent immediately before it. However even if this were possible it is an extremely unlikely fringe case that I can't see ever occurring for a single producer. Maybe someone with a better understanding of the code can weigh in here?
As for ordering of batched messages in one request, the request stores messages internally in a HashMap, which uses TopicPartition as a key, since as far as I am aware a Scala HashMap does not preserve ordering of the inserted elements, I don't think that there are any guarantees around the order in which multiple partitions in one request get processed - which is fine, as ordering is only guaranteed to be preserved within the partition.
Within each partition, messages are processed in the order they were given to the producer before sending.
I'm using Apache Storm to process huge data coming off a Kafka spout. Currently, there are over 3k json messages already published to Kafka and it's continuing. I have to process all the messages published from beginning. So, I have set a Kafka spout parameter accordingly.
This results in a lot of failures in tuple processing. I got this info from the storm UI.
I suspect the storm is not able to handle all the messages bombarded towards it in a single shot.
Any help is appreciated.
1) increase the parallelism hint for the bolts so that there's no backlog slowing down the processing for any tuple emitted by the spout, or
2) use the topology.max.spout.pending property to limit the number of tuples the spout can emit before having to wait for one of those tuples to complete.
try combination of both solutions. In production usually you need to run many iterations to get proper value of both the values (parallelism,topology.max.spout.pending)
This is a question regarding how Storm's max spout pending works. I currently have a spout that reads a file and emits a tuple for each line in the file (I know Storm is not the best solution for dealing with files but I do not have a choice for this problem).
I set the topology.max.spout.pending to 50k to throttle how many tuples get into the topology to be processed. However, I see this number not having any effect in the topology. I see all records in a file being emitted every time. My guess is this might be due to a loop I have in the nextTuple() method that emits all records in a file.
My question is: Does Storm just stop calling nextTuple() for the Spout task when topology.max.spout.pending is reached? Does this mean I should only emit one tuple every time the method is called?
Exactly! Storm can only limit your spout with the next command, so if you transmit everything when you receive the first next, there is no way for Storm to throttle your spout.
The Storm developers recommend emitting a single tuple with a single next command. The Storm framework will then throttle your spout as needed to meet the "max spout pending" requirement. If you're emitting a high number of tuples, you can batch your emits to at most a tenth of your max spout pending, to give Storm the chance to throttle.
Storm topologies have a max spout pending parameter. The max
spout pending value for a topology can be configured via the
“topology.max.spout.pending” setting in the topology
configuration yaml file. This value puts a limit on how many
tuples can be in flight, i.e. have not yet been acked or failed, in a
Storm topology at any point of time. The need for this parameter
comes from the fact that Storm uses ZeroMQ to dispatch
tuples from one task to another task. If the consumer side of
ZeroMQ is unable to keep up with the tuple rate, then the
ZeroMQ queue starts to build up. Eventually tuples timeout at the
spout and get replayed to the topology thus adding more pressure
on the queues. To avoid this pathological failure case, Storm
allows the user to put a limit on the number of tuples that are in
flight in the topology. This limit takes effect on a per spout task
basis and not on a topology level.(source) For cases when the spouts are
unreliable, i.e. they don’t emit a message id in their tuples, this
value has no effect.
One of the problems that Storm users continually face is in
coming up with the right value for this max spout pending
parameter. A very small value can easily starve the topology and
a sufficiently large value can overload the topology with a huge
number of tuples to the extent of causing failures and replays.
Users have to go through several iterations of topology
deployments with different max spout pending values to find the
value that works best for them.
One solution is to build the input queue outside the nextTuple method and the only thing to do in nextTuple is to poll the queue and emit. If you are processing multiple files, your nextTuple method should also check if the result of polling the queue is null, and if yes, atomically reset the source file that is populating your queue.