Storm+Kafka not parallelizing as expected - apache-kafka

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?

I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

Related

Minimizing failure without impacting recovery when building processes on top of Kafka

I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.
It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).
I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.
But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.
So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?
If you are that concerned of data reprocess, you could always follow the paradigm of sending the offsets out of kafka.
For example, in your consumer-worker reading loop:
(pseudocode)
while(...)
{
MessageAndOffset = getMsg();
//do your things
saveOffsetInQueueToDB(offset);
}
saveOffsetInQueueToDB is responsible of adding the offset to a Queue/List, or whatever. This operation is only done one the message has been correctly processed.
Periodically, when a certain number of offsets are stored, or when shutdown is captured, you could implement another function that stores the offsets for each topic/partition in:
An external database.
An external SLA backed storing system, such as S3 or Azure Blobs.
Internal (disk) and remote loggers.
If you are concerned about failures, you could use a combination of two of those three options (or even use all three).
Storing these in a "memory buffer" allows the operation to be async, so there's no need for a new transfer/connection to the database/datalake/log for each processed message.
If there's a crash, you could read all messages from the beginning (easiest way is just changing the group.id and setting from beginning) but discarding those whose offset is included in the database, avoiding the reprocess. For example by adding a condition in your loop (yep pseudocode again):
while(...)
{
MessageAndOffset = getMsg();
if (offset.notIncluded(offsetListFromDB))
{
//do your things
saveOffsetInQueueToDB(offset);
}
}
You could implement better performant algorithms instead a "non-included" type one, just storing the last read offsets for each partition in a HashMap and then just checking if the partition that belongs to each consumer is bigger or not than the stored one. For example, partition 0's last offset was 558 and partitions 1's 600:
//offsetMap = {[0,558],[1,600]}
while(...)
{
MessageAndOffset = getMsg();
//get partition => 0
if (offset > offsetMap.get(partition))
{
//do your things
saveOffsetInQueueToDB(offset);
}
}
This way, you guarantee that only the non-processed messages from each partition will be processed.
Regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas.
So if you have 5 brokers, for example, you must experience a total of 5 different system failures at the same time (I guess brokers are in separate hosts) in order to lose any data. Even 4 different brokers could fail at the same time without losing any data.
All brokers save the same amount of data, same partitions. If a filesystem error occurs in one of the brokers, the others will still hold all the information:

High Memory consumption in MessageChannelPartitionHandler in-case of more partitions

Our use case -> Using Remote partitioning - the job is devided into multiple partitions and using active MQ workers are processing these partitions.
Job is failing with memory issue at MessageChannelPartitionHandler handle method where it is holding more number of StepExecution in memory.(we have around 20K StepExecutions/partitions in this case)
we override message channel partition handler for submitting controlled messages to ActiveMQ and even when we try to poll replies from DB it is having database connection timeout issues and when we increased idle connection this approach as well failing to hold all those StepExecutions in memory.
Either case of our Custom/MessageChannelPartitionHandler we are facing similar issues and these step executions are required to aggregate at master. Do we have any alternative way of achieving this.
Can someone help us to understand better way of handling these long running/huge data processing scenarios?

Flink session window scaling issues on YARN, Kafka

Job use case:
Create groups of events (transactions) that relate to each other, based on event members and event-start-time, from 3 streams, not relying on processing time.
We have a input throughput of arround 20K events/sec.
All events of a transaction are sent to multiple kafka topics (job sources) on the end of a transaction, with the possibility of being late up to minutes.
We want to create event groups that have start times gaps less than a few seconds and are identified by a business key (e.g. country and transaction type).
Transactions can be seconds to hours long, but all events forming a group of transactions should arrive within ~5 minutes after the end of a transaction.
Implementation:
Job consumes 3 data sources, assigns timestamps & watermarks, maps events to a common interface, unions the streams, keys the unioned stream, creates a session window by transaction end time with a lateness, filters groups with huge numbers (bussiness reason), creates sub-groups (based on start-time) within the groups from the session window, and finaly sends the results to another topic.
Issue:
When running the job on a YARN cluster with kafka connectors as I/O, the job is slow, and not scaling well. (~3K events/sec on paralellism 1, to 5K events/sec on parallelism 10)
What I tried:
Playing around with scaling TMs, slots and memory, at max we had arround 10TMs with 2 slot used, each with 32GB memory. (While maintaing the same paralelism on kafka partitions both on input and output)
Playing arround with taskmanager.network.memory.fraction, containerized.heap-cutoff-ratio, taskmanager.memory.fraction
Playing arround with kafka.producer.batch.size, kafka.producer.max.request.size
Setting a fs and then a rocksdb backend and setting taskmanager.memory.off-heap to true, taskmanager.memory.preallocate to true
Turning off checkpointing
Upgrading Flink to 1.9.2, kafka-client to latest used, kafka servers, confluent and cloudera ...
Code:
I recreated a simplified use-case, the code is here: https://github.com/vladimirtomecko/flink-playground and the job looks basically like this:
implicit val (params, config) = PersonTransactionCorrelationConfig.parse(args)
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val aTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
val bTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
val cTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
aTransactions
.union(bTransactions)
.union(cTransactions)
.keyBy(new PersonTransactionKeySelector)
.window(EventTimeSessionWindows.withGap(Time.seconds(config.gapEndSeconds)))
.allowedLateness(Time.seconds(config.latenessSeconds))
.aggregate(new PersonTransactionAggregateFunction)
.filter(new PersonTransactionFilterFunction(config.groupMaxSize))
.flatMap(new PersonTransactionFlatMapFunction(config.gapStartSeconds * 1000))
.addSink(new DiscardingSink[PersonTransactionsGroup])
val backend = new RocksDBStateBackend(config.checkpointsDirectory, true)
backend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED)
env.setStateBackend(backend: StateBackend)
env.execute(getClass.getSimpleName)
Question:
Is my implementation incorrect for this use case ?
Is there something I have missed?
Is there some other optimization I can try ?
I have issues finding the bottleneck of this scenario, any tips there ?
Thank you
P.S. First time poster, be kind please.
Edit 1:
Input kafka topics are partitioned on the producer side with the same function used in the keyBy.
Partitions count is equal to or exactly 2 times greater than the parallelism of the flow.
Partitions per topic have a similar amount of events (deviation ~5-10%).
Topics are populated with different amounts of events (A has 10X more events than B, B has 1000X more events than C), but afaik this shouldnt be an issue.

What exactly is StreamTask in StreamThread in kafka streams?

I am trying to understand how Kafka Stream work under the hood (to know it a little better), and came across confluent link, and it is really wonderful.
It says two terms viz: StreamThreads and StreamTasks.
I am not able to understand what exactly is StreamTasks?
Is it executed by StreamThread?
As per doc, StreamThreads can have multiple StreamTasks, so won't there be any data sharing and won't this thread run slower? How does a StreamThread "run" multiple StreamTasks?
Any explanation in simple words would be of great help.
"Tasks" are a logical abstractions of work than can be done in parallel (ie, stuff that can be processed independent from each other). Kafka Streams basically creates a task for each input topic partition, because data in different partitions can processed independent from each other (it's a simplification, but holds if you have a single input topic; for joins it's a little bit different).
A StreamThread is basically a JVM thread. Task are assigned to StreamsThread for execution. In the current implementation, a StreamThread basically loops over all tasks and processes some amount of input data for each task. In between, the StreamThread (that is using a KafkaConsumer) polls the broker for new data for all its assigned tasks.
Because tasks are independent from each other, you can run as many thread as there are tasks. For this case, each thread would execute only a single task.

Scaling Kafka: how new event processing capacity is added dynamically?

To a large extent getting throughout in a system on Kafka rests of these degrees of freedom:
(highly recommended) messages should be share nothing. If share-nothing they can be randomly assigned to different partitions within a topic and processed independently of other messages
(highly recommended) the partition count per topic should be sized. More partitions per topic equals greater possible levels of parallelism
(highly recommended) to avoid hotspots within a topic partition, the Kafka key may need to include time or some other varying data point so that a single partition does not unintentionally get the majority of the work
(helpful) the processing time per message should be small when possible
https://dzone.com/articles/20-best-practices-for-working-with-apache-kafka-at mentions other items fine tuning these principles
Now suppose that on an otherwise OK system, one will get a lot of new work. For example, a new and large client may be added mid-day or an existing client may need to onboard a new account adding zillions of new events. How do we scale horizontally adding new capacity for this work?
If the messages are truly share-nothing throughout the entire system --- I have a data pipeline of services where A gets a message, processes it, publishes a new message to another service B, and so on --- adding new capacity to the system could be easy as sending a message on an separate administration topic telling the consumer task(s) to spin up new threads. Then so long as the number of partitions in the topic(s) is not a bottleneck, one would have indeed add new processing capacity.
This approach is certainly doable but is still un-optimal in these respects:
Work on different clientIds is definitely share-nothing. Merely adding new threads takes work faster, but any new work would interleave behind and within the existing client work. Had a new topic been available with a new pub/sub process pair(s), the new work could be done in parallel if the cluster has spare capacity on the new topic(s)
In general, share-nothing work may not be always possible at every step in a data pipeline. If ordering was ever required, the addition of new subscriber threads could get messages out of order for a given topic, partition. This happens when there are M paritions in a topic but >M subscriber threads. I have one such order sensitive case. It's worth noting then that ordering effectively means at most 1 subscriber thread per partition so sizing paritions may be even more important.
Tasks may not be allowed to add topics at runtime by the sysadmin
Even if adding topics at runtime is possible, system orchestration is required to tell various produces that clientID no longer is associated with the old topic T, but rather T'. WIP on T should be flushed first before using T'
How does the Cassandra community deal with adding capacity at runtime or is this day-dreaming? Adding new capacity via in this way seems to roughly center on:
Dynamic, elastic horizontal capacity seems to broadly center on these principles:
have spare capacity on your cluster
have extra unused topics for greater parallelism; create them at runtime or pre-create but not use if sys-admins don't allow dynamically creation
equip the system so that events for a given clientID can be intercepted before they enter the pipeline and deferred to a special queue, know when existing events on the clientID have flushed through the system, then update config(s) sending the held/deferred events and any new events on new clients to the new topic
Telling consumers to spin up more listeners
Dynamically adding more partitions? (Doubt that's possible or practical)