Flink session window scaling issues on YARN, Kafka

Flink session window scaling issues on YARN, Kafka - scala

Job use case:
Create groups of events (transactions) that relate to each other, based on event members and event-start-time, from 3 streams, not relying on processing time.
We have a input throughput of arround 20K events/sec.
All events of a transaction are sent to multiple kafka topics (job sources) on the end of a transaction, with the possibility of being late up to minutes.
We want to create event groups that have start times gaps less than a few seconds and are identified by a business key (e.g. country and transaction type).
Transactions can be seconds to hours long, but all events forming a group of transactions should arrive within ~5 minutes after the end of a transaction.
Implementation:
Job consumes 3 data sources, assigns timestamps & watermarks, maps events to a common interface, unions the streams, keys the unioned stream, creates a session window by transaction end time with a lateness, filters groups with huge numbers (bussiness reason), creates sub-groups (based on start-time) within the groups from the session window, and finaly sends the results to another topic.
Issue:
When running the job on a YARN cluster with kafka connectors as I/O, the job is slow, and not scaling well. (~3K events/sec on paralellism 1, to 5K events/sec on parallelism 10)
What I tried:
Playing around with scaling TMs, slots and memory, at max we had arround 10TMs with 2 slot used, each with 32GB memory. (While maintaing the same paralelism on kafka partitions both on input and output)
Playing arround with taskmanager.network.memory.fraction, containerized.heap-cutoff-ratio, taskmanager.memory.fraction
Playing arround with kafka.producer.batch.size, kafka.producer.max.request.size
Setting a fs and then a rocksdb backend and setting taskmanager.memory.off-heap to true, taskmanager.memory.preallocate to true
Turning off checkpointing
Upgrading Flink to 1.9.2, kafka-client to latest used, kafka servers, confluent and cloudera ...
Code:
I recreated a simplified use-case, the code is here: https://github.com/vladimirtomecko/flink-playground and the job looks basically like this:
implicit val (params, config) = PersonTransactionCorrelationConfig.parse(args)
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val aTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
val bTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
val cTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
aTransactions
.union(bTransactions)
.union(cTransactions)
.keyBy(new PersonTransactionKeySelector)
.window(EventTimeSessionWindows.withGap(Time.seconds(config.gapEndSeconds)))
.allowedLateness(Time.seconds(config.latenessSeconds))
.aggregate(new PersonTransactionAggregateFunction)
.filter(new PersonTransactionFilterFunction(config.groupMaxSize))
.flatMap(new PersonTransactionFlatMapFunction(config.gapStartSeconds * 1000))
.addSink(new DiscardingSink[PersonTransactionsGroup])
val backend = new RocksDBStateBackend(config.checkpointsDirectory, true)
backend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED)
env.setStateBackend(backend: StateBackend)
env.execute(getClass.getSimpleName)
Question:
Is my implementation incorrect for this use case ?
Is there something I have missed?
Is there some other optimization I can try ?
I have issues finding the bottleneck of this scenario, any tips there ?
Thank you
P.S. First time poster, be kind please.
Edit 1:
Input kafka topics are partitioned on the producer side with the same function used in the keyBy.
Partitions count is equal to or exactly 2 times greater than the parallelism of the flow.
Partitions per topic have a similar amount of events (deviation ~5-10%).
Topics are populated with different amounts of events (A has 10X more events than B, B has 1000X more events than C), but afaik this shouldnt be an issue.

Related

How to scale a Flink Job that consumes a huge topic

The setup:
Flink version 1.12
Deployment on Yarn
Programming language: Scala
Flink job:
Two input kafka topics and one output kafka topic
Input1: is a huge topic between 300K and 500K messages per second. Each message has 600 fields.
Input2: is a small topic about 20K messages per second once per day. Each message has 22 fields.
The goal is to enrich Input1 with Input2 and the output is a kafka topic where every message has 100 fields from Input1 and 13 fields from Input2.
I keep a state from input2 as MapState
I use RichCoMapFunction to do the mapping
This is a snippet from the code where I connect both streams:
stream1.connect(stream2)
.keyBy(_.getKey1,_.getKey2)
.map(new RichCoMapFunction)
I use setAutoWatermarkInterval = 300000
No checkPoints or savingPoints are currently used
Flink Configurations:
Number of partitions for Input1 = 120
Number of Partitions for Input2 = 30
Number of partitions for the output topic = 120
Total number of Parallelism = 700
Number of Parallelism for input1 = 120
Number of Parallelism for input2 = 30
Join Parallelism:700 (Number of parallelism to connect both stream. This is set as following:
stream1.connect(stream2)
.keyBy(_.getKey1,_.getKey2)
.map(new RichCoMapFunction)
.setParallelism(700)
jobManagerMemoryFlinkSize:4096m
taskManagerMemoryFlinkSize:3072m
taskManagerMemoryManagedSize:1b
clusterEvenlySpreadOutSlots:true
akkaThroughput:1500
Yarn Configurations:
yarnSlots = 4
yarnjobManagerMemory = 5120m
yarntaskManagerMemory = 4096m
Total Number of Task Slots = 700
Number of Task Managers = 175
Problem:
The latency on the output topic is around 30min which is unacceptable for our use case.
I tried many other Flink configurations related to Memory allocations and vCores but it didn't help.
It would be great if you have any suggestions on how can we scale to reach higher throughput and lower latency.
EDIT1: The RichCoMapFunction code:
class Stream1WithStream2CoMapFunction extends RichCoMapFunction[Input1, Input2, Option[Output]] {
private var input2State: MapState[Long, Input2] = _
override def open(parameters: Configuration): Unit = {
val ttlConfig = StateTtlConfig
.newBuilder(org.apache.flink.api.common.time.Time.days(3))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build()
val mapStateDescriptor = new MapStateDescriptor[Long, Input2]("input2State", classOf[Long], classOf[Input2])
mapStateDescriptor.enableTimeToLive(ttlConfig)
input2State = getRuntimeContext.getMapState(mapStateDescriptor)
}
override def map1(value: Input1): Option[Output] = {
// Create a new object of type Output (enrich input1 with fields from input2 from the state)
}
override def map2(value: Input2): Option[Output] = {
// Put the value in the input2State
}
}

You could use a profiler (or the flame graphs added to Flink 1.13) to try to diagnose why this is running slowly. The backpressure/busy monitoring added in Flink 1.13 would also be helpful.
But my guess is that tremendous effort is going into serde. If you aren't already doing so, you should eliminate all unnecessary fields from stream1 as early as possible in the pipeline, so that the data that won't be used never has to be serialized. For a first pass, you could do this in a map operator chained to the source (at the same parallelism as the source), but a custom serializer will ultimately yield better performance.
You haven't mentioned the sink, but sinks are often a culprit in these situations. I assume it's Kafka (since you mentioned the output topic), and I'm assume you're not using Kafka transactions (since checkpointing is disabled). But how is the sink configured?
Why have you set the AutoWatermarkInterval to 300000 if your job isn't using watermarks? If you are using watermarks somewhere, this will add up to 5 minutes of latency. If you're not using watermarks, this setting is meaningless.
And why have you set akkaThroughput: 1500? This looks suspicious. I would experiment with resetting this to the default value (15).
Is there any other custom tuning, such as network buffering? I would call into question all non-default configuration settings (though I'm sure some are justified, like memory).
I would also set the parallelism for the whole job to a uniform value, e.g., 700. Fine-tuning individual stages of the pipeline is rarely helpful, and can be harmful.
How have you set maxParallelism? I would set it to something like 2800 or 3500 so that you have at least 4 or 5 key groups per slot.
Could it be that a few instances are doing most of the work? You can examine the metrics on the various sub-tasks of the RichCoMapFunction and look for skew. E.g., look at numRecordsInPerSecond.

Scaling Kafka: how new event processing capacity is added dynamically?

To a large extent getting throughout in a system on Kafka rests of these degrees of freedom:
(highly recommended) messages should be share nothing. If share-nothing they can be randomly assigned to different partitions within a topic and processed independently of other messages
(highly recommended) the partition count per topic should be sized. More partitions per topic equals greater possible levels of parallelism
(highly recommended) to avoid hotspots within a topic partition, the Kafka key may need to include time or some other varying data point so that a single partition does not unintentionally get the majority of the work
(helpful) the processing time per message should be small when possible
https://dzone.com/articles/20-best-practices-for-working-with-apache-kafka-at mentions other items fine tuning these principles
Now suppose that on an otherwise OK system, one will get a lot of new work. For example, a new and large client may be added mid-day or an existing client may need to onboard a new account adding zillions of new events. How do we scale horizontally adding new capacity for this work?
If the messages are truly share-nothing throughout the entire system --- I have a data pipeline of services where A gets a message, processes it, publishes a new message to another service B, and so on --- adding new capacity to the system could be easy as sending a message on an separate administration topic telling the consumer task(s) to spin up new threads. Then so long as the number of partitions in the topic(s) is not a bottleneck, one would have indeed add new processing capacity.
This approach is certainly doable but is still un-optimal in these respects:
Work on different clientIds is definitely share-nothing. Merely adding new threads takes work faster, but any new work would interleave behind and within the existing client work. Had a new topic been available with a new pub/sub process pair(s), the new work could be done in parallel if the cluster has spare capacity on the new topic(s)
In general, share-nothing work may not be always possible at every step in a data pipeline. If ordering was ever required, the addition of new subscriber threads could get messages out of order for a given topic, partition. This happens when there are M paritions in a topic but >M subscriber threads. I have one such order sensitive case. It's worth noting then that ordering effectively means at most 1 subscriber thread per partition so sizing paritions may be even more important.
Tasks may not be allowed to add topics at runtime by the sysadmin
Even if adding topics at runtime is possible, system orchestration is required to tell various produces that clientID no longer is associated with the old topic T, but rather T'. WIP on T should be flushed first before using T'
How does the Cassandra community deal with adding capacity at runtime or is this day-dreaming? Adding new capacity via in this way seems to roughly center on:
Dynamic, elastic horizontal capacity seems to broadly center on these principles:
have spare capacity on your cluster
have extra unused topics for greater parallelism; create them at runtime or pre-create but not use if sys-admins don't allow dynamically creation
equip the system so that events for a given clientID can be intercepted before they enter the pipeline and deferred to a special queue, know when existing events on the clientID have flushed through the system, then update config(s) sending the held/deferred events and any new events on new clients to the new topic
Telling consumers to spin up more listeners
Dynamically adding more partitions? (Doubt that's possible or practical)

Synchronize Data From Multiple Data Sources

Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.
We are at the design phase and the current system design is as follows:
The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
Each data source has its own queue (Kafka Topic).
From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
Depending upon the feature set, an inference engine will subscribe to
multiple Kafka topics and stream data from those topics to continuously output the inference.
The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.
Problem:
In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized.
So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.
Question
We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?

I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.
There are 3 ways to define Windows in KSQL:
hopping windows, tumbling windows, and session windows. Hopping and
tumbling windows are time windows, because they're defined by fixed
durations they you specify. Session windows are dynamically sized
based on incoming data and defined by periods of activity separated by
gaps of inactivity.
In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,
SELECT t1.id, ...
FROM topic_1 t1
INNER JOIN topic_2 t2
WITHIN 1 HOURS
ON t1.id = t2.id;

Some suggestions -
Handle delay at the producer end -
Ensure all three producers always send data in sync to Kafka topics by using batch.size and linger.ms.
eg. if linger.ms is set to 1000, all messages would be sent to Kafka within 1 second.
Handle delay at the consumer end -
Considering any streaming engine at the consumer side (be it Kafka-stream, spark-stream, Flink), provides windows functionality to join/aggregate stream data based on keys while considering delayed window function.
Check this - Flink windows for reference how to choose right window type link

To handle this scenario, data sources must provide some mechanism for the consumer to realize that all relevant data has arrived. The simplest solution is to publish a batch from data source with a batch Id (Guid) of some form. Consumers can then wait until the next batch id shows up marking the end of the previous batch. This approach assumes sources will not skip a batch, otherwise they will get permanently mis-aligned. There is no algorithm to detect this but you might have some fields in the data that show discontinuity and allow you to realign the data.
A weaker version of this approach is to either just wait x-seconds and assume all sources succeed in this much time or look at some form of time stamps (logical or wall clock) to detect that a source has moved on to the next time window implicitly showing completion of the last window.

The following recommendations should maximize success of event synchronization for the anomaly detection problem using timeseries data.
Use a network time synchronizer on all producer/consumer nodes
Use a heartbeat message from producers every x units of time with a fixed start time. For eg: the messages are sent every two minutes at the start of the minute.
Build predictors for producer message delay. use the heartbeat messages to compute this.
With these primitives, we should be able to align the timeseries events, accounting for time drifts due to network delays.
At the inference engine side, expand your windows at a per producer level to synch up events across producers.

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?

I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

Spark Streaming mapWithState seems to rebuild complete state periodically

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.
The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.
While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:
The yellow fields represent the high processing time.
A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.
My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.
Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?

Is this a bug in the mapWithState() functionality or is this intended
behaviour?
This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.
Here is the relevant piece of code from MapWithStateDStream:
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
private[streaming] object InternalMapWithStateDStream {
private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.
This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.

In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.
Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].
This has two performance implications, which occur only once per 10 batches:
The traversal and deletion process has its price
If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse