What exactly is StreamTask in StreamThread in kafka streams? - apache-kafka

I am trying to understand how Kafka Stream work under the hood (to know it a little better), and came across confluent link, and it is really wonderful.
It says two terms viz: StreamThreads and StreamTasks.
I am not able to understand what exactly is StreamTasks?
Is it executed by StreamThread?
As per doc, StreamThreads can have multiple StreamTasks, so won't there be any data sharing and won't this thread run slower? How does a StreamThread "run" multiple StreamTasks?
Any explanation in simple words would be of great help.

"Tasks" are a logical abstractions of work than can be done in parallel (ie, stuff that can be processed independent from each other). Kafka Streams basically creates a task for each input topic partition, because data in different partitions can processed independent from each other (it's a simplification, but holds if you have a single input topic; for joins it's a little bit different).
A StreamThread is basically a JVM thread. Task are assigned to StreamsThread for execution. In the current implementation, a StreamThread basically loops over all tasks and processes some amount of input data for each task. In between, the StreamThread (that is using a KafkaConsumer) polls the broker for new data for all its assigned tasks.
Because tasks are independent from each other, you can run as many thread as there are tasks. For this case, each thread would execute only a single task.

Related

Difference between executing StreamTasks in the same instance v/s multiple instances

Say I have a topic with 3 partitions
Method 1: I run one instance of Kafka Streams, it starts 3 tasks [0_0,0_1,0_2] and each of these tasks consume from one partition.
Method 2: I spin up three instance of the same streams application, here again three tasks are started but now, it is distributed among the 3 instances that was created.
Which method is preferable and why?
In method 1 do all the tasks run as a part of the same thread, and in method 2, they run on different threads, or is it different?
Consider that the streams application has a very simple topology, and does only mapping of values from a single stream
By default, a single KafkaStreams instance runs one thread, thus in "Method 1" all three tasks are executed by a single thread. In "Method 2" each task is executed by its own thread. Note, that you can also configure multiple thread pre KafkaStreams instance via num.stream.threads configuration parameter. If you set it to 3 for "Method 1" both method are more or less the same. How many threads you need, depends on your workload, ie, how many messages you need to process per time unit and how expensive the computation is. It also depends on the hardware: for a single-core CPU, it may not make sense to configure more than one thread, but you should deploy multiple instances on multiple machines to get more hardware. Hence, if your workload is lightweight one single-threaded instance might be enough.
Also note, that you may be network bound. For this case, starting more thread would not help, but you want to scale out to multiple machines, too.
The last consideration is fault-tolerance. Even if a single thread/instance may be powerful enough to not lag, what should happen if the instance crashes? If you only have one instance, the whole computation goes down. If you run two instances, the second instance would take over all the work and your application stays online.

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

Can I attach multiple transformers/processors to a single stream in Apache Kafka

In all example I see a simple single transformer/processor topology for Kafka. My doubt is whether we can modularise application logic by breaking down in to multiple transformers/processors applying sequentially to a single input stream.
Please find use case below :
Current application configuration is a single processor containing all processing logic tasks like filtering, validation, application logic, delaying(Kafka is too fast for dbs) and invoke SP/push to down stream.
But we are now planning to decouple all these operations by breaking down each task into separate processors/transformers of Kstream.
Since we are relatively new to Kafka, we are not sure of the pros and cons of this approach especially with respect to Kafka internals like state store/ task scheduling/ multithreading model.
Please share your expert opinions and experiences
Please note that we do not have control over topic, no new topic can be created for this design. The design must be feasible for the existing topic alone.
Kafka Streams allows you to split your logic into multiple processors. Internally, Kafka Streams implements a "depth-first" execution strategy. Thus, each time you call "forward" the output tuple is immediately processed by the downstream processor and "forward" return after downstream processing finished (note, that writing data into a topic and reading it back "breaks" the in-memory pipeline -- thus, when data is written to a topic, there is no guarantee when downstream processor will read and process those records).
If you have state that is shared between multiple processor, you would need to attach the store to all processor that need to access to store. The execution on the store will be single threaded and thus, there should be no performance difference.
As long as you connect processor directly (and not via topics) all processor will be part of the same tasks. Thus, there shouldn't be a performance difference.

How can (messaging) queue be scalable?

I frequently see queues in software architecture, especially those called "scalable" with prominent representative of Actor from Akka.io multi-actor platform. However, how can queue be scalable, if we have to synchronize placing messages in queue (and therefore operate in single thread vs multi thread) and again synchronize taking out messages from queue (to assure, that message it taken exactly once)? It get's even more complicated, when those messages can change state of (actor) system - in this case even after taking out message from queue, it cannot be load balanced, but still processed in single thread.
Is it correct, that putting messages in queue must be synchronized?
Is it correct, that putting messages out of queue must be synchronized?
If 1 or 2 is correct, then how is queue scalable? Doesn't synchronization to single thread immediately create bottleneck?
How can (actor) system be scalable, if it is statefull?
Does statefull actor/bean mean, that I have to process messages in single thread and in order?
Does statefullness mean, that I have to have single copy of bean/actor per entire system?
If 6 is false, then how do I share this state between instances?
When I am trying to connect my new P2P node to netowrk, I believe I have to have some "server" that will tell me, who are other peers, is that correct? When I am trying to download torrent, I have to connect to tracker - if there is "server" then we do we call it P2P? If this tracker will go down, then I cannot connect to peers, is that correct?
Is synchronization and statefullness destroying scalability?
Is it correct, that putting messages in queue must be synchronized?
Is it correct, that putting messages out of queue must be synchronized?
No.
Assuming we're talking about the synchronized java keyword then that is a reenetrant mutual exclusion lock on the object. Even multiple threads accessing that lock can be fast as long as contention is low. And each object has its own lock so there are many locks, each which only needs to be taken for a short time, i.e. it is fine-grained locking.
But even if it did, queues need not be implemented via mutual exclusion locks. Lock-free and even wait-free queue data structures exist. Which means the mere presence of locks does not automatically imply single-threaded execution.
The rest of your questions should be asked separately because they are not about message queuing.
Of course you are correct in that a single queue is not scalable. The point of the Actor Model is that you can have millions of Actors and therefore distribute the load over millions of queues—if you have so many cores in your cluster. Always remember what Carl Hewitt said:
One Actor is no actor. Actors come in systems.
Each single actor is a fully sequential and single-threaded unit of computation. The whole model is constructed such that it is perfectly suited to describe distribution, though; this means that you create as many actors as you need.

Scala task parallelization with actors => How does the scheduler work?

I have a task which can be easily be broken into parts which can and should be processed in parallel to optimize performance.
I wrote an producer actor which prepares each part of the task that could be processed independently. This preparation is relatively cheap.
I wrote a consumer Actor that processes each of the independent tasks. Depending on the parameters each piece of independent task may take up to a couple of seconds to be processed. All tasks are quite the same. They all process the same algorithm, with the same amount of data (but different values of course) resulting in about equal time of processing.
So the producer is much faster than the consumer. Hence there quickly may be 200 or 2000 tasks prepared (depending on the parameters). All of them consuming memory while just a couple of them can be executed at at once.
Now I see two simple strategies to consume and process the tasks:
Create a new consumer actor instance for each task.
Each consumer processes only on task.
I assume there would be many consumer actor instances at the same time, while only a couple of them, can be processed at any point in time.
How does the default scheduler work? Can each consumer actor finish processing before the next consumer will be scheduled? Or will a consumer be interrupted and be replaced by another consumer resulting in longer time until the first task will be finished? I think this actor scheduling is not the same as process or thread scheduling, but I can imagine, that interruption can still have some disadvantages (e.g. like more cache misses).
The other strategy is to use N instances of the consumer actor and send the tasks to process as messages to them.
Each consumer processes multiple tasks in sequence.
It is left up to me, to find a appropriate value for the N (number of consumers).
The distribution of the tasks over the N consumers is also left up to me.
I could imagine a more sophisticated solution where more coordination is done between the producer and the consumers, but I can't make a good decision without knowledge about the scheduler.
If manual solution will not result in significant better performance, I would prefer a default solution (delivered by some part of the Scala world), where scheduling tasks are not left up to me (like strategy 1).
Question roundup:
How does the default scheduler work?
Can each consumer actor finish processing before the next consumer will be scheduled?
Or will a consumer be interrupted and be replaced by another consumer resulting in longer time until the first task will be finished?
What are the disadvantages when the scheduler frequently interrupts an actor and schedules another one? Cache-Misses?
Would this interruption and scheduling be like a context-change in process scheduling or thread scheduling?
Are there any more advantages or disadvantages comparing these strategies?
Especially does strategy 1 have disadvantages over strategy 2?
Which of these strategies is the best?
Is there a better strategy than I proposed?
I'm afraid, that questions like the last two can not be answered absolutely, but maybe this is possible this time as I tried to give a case as concrete as possible.
I think the other questions can be answered without much discussion. With those answers it should be possible to choose the strategy fitting the requirements best.
I made some research and thoughts myself and came up with some assumptions. If any of these assumptions are wrong, please tell me.
If I were you, I would have gone ahead with 2nd option. A new actor instance for each task would be too tedious. Also with smart decision of N, complete system resources can be used.
Though this is not a complete solution. But one possible option is that, can't the producer stop/slow down the rate of producing tasks? This would be ideal. Only when there is a consumer available or something, the producer will produce more tasks.
Assuming you are using Akka (if you don't, you should ;-) ), you could use a SmallestMailboxRouter to start a number of actors (you can also add a Resizer) and the message distribution will be handled according to some rules. You can read everything about routers here.
For such a simple task, actors give no profit at all. Implement the producer as a Thread, and each task as a Runnable. Use a thread pool from java.util.concurrent to run the tasks. Use a java.util.concurrent. Semaphore to limit the number of prepared and running tasks: before creating the next tasks, producer aquires the sempahore, and each task releases the semaphore at the end of its execution.