How to scale a Flink Job that consumes a huge topic - scala

The setup:
Flink version 1.12
Deployment on Yarn
Programming language: Scala
Flink job:
Two input kafka topics and one output kafka topic
Input1: is a huge topic between 300K and 500K messages per second. Each message has 600 fields.
Input2: is a small topic about 20K messages per second once per day. Each message has 22 fields.
The goal is to enrich Input1 with Input2 and the output is a kafka topic where every message has 100 fields from Input1 and 13 fields from Input2.
I keep a state from input2 as MapState
I use RichCoMapFunction to do the mapping
This is a snippet from the code where I connect both streams:
stream1.connect(stream2)
.keyBy(_.getKey1,_.getKey2)
.map(new RichCoMapFunction)
I use setAutoWatermarkInterval = 300000
No checkPoints or savingPoints are currently used
Flink Configurations:
Number of partitions for Input1 = 120
Number of Partitions for Input2 = 30
Number of partitions for the output topic = 120
Total number of Parallelism = 700
Number of Parallelism for input1 = 120
Number of Parallelism for input2 = 30
Join Parallelism:700 (Number of parallelism to connect both stream. This is set as following:
stream1.connect(stream2)
.keyBy(_.getKey1,_.getKey2)
.map(new RichCoMapFunction)
.setParallelism(700)
jobManagerMemoryFlinkSize:4096m
taskManagerMemoryFlinkSize:3072m
taskManagerMemoryManagedSize:1b
clusterEvenlySpreadOutSlots:true
akkaThroughput:1500
Yarn Configurations:
yarnSlots = 4
yarnjobManagerMemory = 5120m
yarntaskManagerMemory = 4096m
Total Number of Task Slots = 700
Number of Task Managers = 175
Problem:
The latency on the output topic is around 30min which is unacceptable for our use case.
I tried many other Flink configurations related to Memory allocations and vCores but it didn't help.
It would be great if you have any suggestions on how can we scale to reach higher throughput and lower latency.
EDIT1: The RichCoMapFunction code:
class Stream1WithStream2CoMapFunction extends RichCoMapFunction[Input1, Input2, Option[Output]] {
private var input2State: MapState[Long, Input2] = _
override def open(parameters: Configuration): Unit = {
val ttlConfig = StateTtlConfig
.newBuilder(org.apache.flink.api.common.time.Time.days(3))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build()
val mapStateDescriptor = new MapStateDescriptor[Long, Input2]("input2State", classOf[Long], classOf[Input2])
mapStateDescriptor.enableTimeToLive(ttlConfig)
input2State = getRuntimeContext.getMapState(mapStateDescriptor)
}
override def map1(value: Input1): Option[Output] = {
// Create a new object of type Output (enrich input1 with fields from input2 from the state)
}
override def map2(value: Input2): Option[Output] = {
// Put the value in the input2State
}
}

You could use a profiler (or the flame graphs added to Flink 1.13) to try to diagnose why this is running slowly. The backpressure/busy monitoring added in Flink 1.13 would also be helpful.
But my guess is that tremendous effort is going into serde. If you aren't already doing so, you should eliminate all unnecessary fields from stream1 as early as possible in the pipeline, so that the data that won't be used never has to be serialized. For a first pass, you could do this in a map operator chained to the source (at the same parallelism as the source), but a custom serializer will ultimately yield better performance.
You haven't mentioned the sink, but sinks are often a culprit in these situations. I assume it's Kafka (since you mentioned the output topic), and I'm assume you're not using Kafka transactions (since checkpointing is disabled). But how is the sink configured?
Why have you set the AutoWatermarkInterval to 300000 if your job isn't using watermarks? If you are using watermarks somewhere, this will add up to 5 minutes of latency. If you're not using watermarks, this setting is meaningless.
And why have you set akkaThroughput: 1500? This looks suspicious. I would experiment with resetting this to the default value (15).
Is there any other custom tuning, such as network buffering? I would call into question all non-default configuration settings (though I'm sure some are justified, like memory).
I would also set the parallelism for the whole job to a uniform value, e.g., 700. Fine-tuning individual stages of the pipeline is rarely helpful, and can be harmful.
How have you set maxParallelism? I would set it to something like 2800 or 3500 so that you have at least 4 or 5 key groups per slot.
Could it be that a few instances are doing most of the work? You can examine the metrics on the various sub-tasks of the RichCoMapFunction and look for skew. E.g., look at numRecordsInPerSecond.

Related

Flink session window scaling issues on YARN, Kafka

Job use case:
Create groups of events (transactions) that relate to each other, based on event members and event-start-time, from 3 streams, not relying on processing time.
We have a input throughput of arround 20K events/sec.
All events of a transaction are sent to multiple kafka topics (job sources) on the end of a transaction, with the possibility of being late up to minutes.
We want to create event groups that have start times gaps less than a few seconds and are identified by a business key (e.g. country and transaction type).
Transactions can be seconds to hours long, but all events forming a group of transactions should arrive within ~5 minutes after the end of a transaction.
Implementation:
Job consumes 3 data sources, assigns timestamps & watermarks, maps events to a common interface, unions the streams, keys the unioned stream, creates a session window by transaction end time with a lateness, filters groups with huge numbers (bussiness reason), creates sub-groups (based on start-time) within the groups from the session window, and finaly sends the results to another topic.
Issue:
When running the job on a YARN cluster with kafka connectors as I/O, the job is slow, and not scaling well. (~3K events/sec on paralellism 1, to 5K events/sec on parallelism 10)
What I tried:
Playing around with scaling TMs, slots and memory, at max we had arround 10TMs with 2 slot used, each with 32GB memory. (While maintaing the same paralelism on kafka partitions both on input and output)
Playing arround with taskmanager.network.memory.fraction, containerized.heap-cutoff-ratio, taskmanager.memory.fraction
Playing arround with kafka.producer.batch.size, kafka.producer.max.request.size
Setting a fs and then a rocksdb backend and setting taskmanager.memory.off-heap to true, taskmanager.memory.preallocate to true
Turning off checkpointing
Upgrading Flink to 1.9.2, kafka-client to latest used, kafka servers, confluent and cloudera ...
Code:
I recreated a simplified use-case, the code is here: https://github.com/vladimirtomecko/flink-playground and the job looks basically like this:
implicit val (params, config) = PersonTransactionCorrelationConfig.parse(args)
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val aTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
val bTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
val cTransactions: DataStream[PersonTransactionUniversal] =
env
.addSource(???)
.assignTimestampsAndWatermarks(???)
.flatMap(_.toUniversal: Option[PersonTransactionUniversal])
aTransactions
.union(bTransactions)
.union(cTransactions)
.keyBy(new PersonTransactionKeySelector)
.window(EventTimeSessionWindows.withGap(Time.seconds(config.gapEndSeconds)))
.allowedLateness(Time.seconds(config.latenessSeconds))
.aggregate(new PersonTransactionAggregateFunction)
.filter(new PersonTransactionFilterFunction(config.groupMaxSize))
.flatMap(new PersonTransactionFlatMapFunction(config.gapStartSeconds * 1000))
.addSink(new DiscardingSink[PersonTransactionsGroup])
val backend = new RocksDBStateBackend(config.checkpointsDirectory, true)
backend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED)
env.setStateBackend(backend: StateBackend)
env.execute(getClass.getSimpleName)
Question:
Is my implementation incorrect for this use case ?
Is there something I have missed?
Is there some other optimization I can try ?
I have issues finding the bottleneck of this scenario, any tips there ?
Thank you
P.S. First time poster, be kind please.
Edit 1:
Input kafka topics are partitioned on the producer side with the same function used in the keyBy.
Partitions count is equal to or exactly 2 times greater than the parallelism of the flow.
Partitions per topic have a similar amount of events (deviation ~5-10%).
Topics are populated with different amounts of events (A has 10X more events than B, B has 1000X more events than C), but afaik this shouldnt be an issue.

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

Active Batches piling up with spark streaming with Kafka

I have developed spark streaming(1.6.2) with Kafka in receiver model and running this job with batch size as 15 seconds.
The very first batch is getting a lot of events and processing records very slow. Suddenly job fails and it restarts again. Please see the below screenshot.
It is processing records slowly but not as expected to finish all these batches on time and don't want to see this queue piling up.
How can we control this input size to around 15 to 20k events? I tried enabling spark.streaming.backpressure.enabled but don't see any improvement.
I also implemented Level of Parallelism in Data Receiving as below but still, I didn't see any change in the input size.
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
I am using 6 executors and 20 cores.
Overview of my code:
I am reading logs from Kafka and processing them and storing in elasticsearch in every 15 seconds batch interval.
Could you please let me know how can I control input size and improve the performance of the job or how can we make sure that the batches are not pilling up.

Why does word2vec only take one task for mapPartitionsWithIndex at Word2Vec.scala:323

I am running word2vec in spark and when it comes to fit(), only one task is observed in UI as in image:
.
As per the configuration, num-executors = 1000, executor-cores = 2. And the RDD coalesces to 2000 partitions. It takes quite a long time for mapPartitionsWithIndex. Can it be distributed to multiple executors or tasks?
setNumPartitions(numPartitions: Int) solves my problem. I did not check the default value.
Sets number of partitions (default: 1).

Kafka spout bad performance

I'm testing simple topology to check kafka spout performance.
It contains kafka spout and Bolt that acknowledge each tuple.
Bolt execute method:
public void execute(Tuple input) {
collector.ack(input);
}
Topology looks like this:
protected void configureTopology(TopologyBuilder topologyBuilder) {
configureKafkaCDRSpout(topologyBuilder);
configureKafkaSpoutBandwidthTesterBolt(topologyBuilder);
}
private void configureKafkaCDRSpout(TopologyBuilder builder) {
KafkaSpout kafkaSpout = new KafkaSpout(createKafkaCDRSpoutConfig());
int spoutCount = Integer.valueOf(topologyConfig.getProperty("kafka.cboss.cdr.spout.thread.count"));
builder.setSpout(KAFKA_CDR_SPOUT_ID, kafkaSpout, spoutCount)
.setNumTasks(Integer.valueOf(topologyConfig.getProperty(KAFKA_CDR_SPOUT_NUM_TASKS)));
}
private SpoutConfig createKafkaCDRSpoutConfig() {
BrokerHosts hosts = new ZkHosts(topologyConfig.getProperty("kafka.zookeeper.broker.host"));
String topic = topologyConfig.getProperty("kafka.cboss.cdr.topic");
String zkRoot = topologyConfig.getProperty("kafka.cboss.cdr.zkRoot");
String consumerGroupId = topologyConfig.getProperty("kafka.cboss.cdr.consumerId");
SpoutConfig kafkaSpoutConfig = new SpoutConfig(hosts, topic, zkRoot, consumerGroupId);
kafkaSpoutConfig.scheme = new SchemeAsMultiScheme(new CbossCdrScheme());
kafkaSpoutConfig.ignoreZkOffsets = true;
kafkaSpoutConfig.fetchSizeBytes = Integer.valueOf(topologyConfig.getProperty("kafka.fetchSizeBytes"));
kafkaSpoutConfig.bufferSizeBytes = Integer.valueOf(topologyConfig.getProperty("kafka.bufferSizeBytes"));
return kafkaSpoutConfig;
}
public void configureKafkaSpoutBandwidthTesterBolt(TopologyBuilder topologyBuilder) {
SimpleAckerBolt b = new SimpleAckerBolt();
topologyBuilder.setBolt(SPOUT_BANDWIDTH_TESTER_BOLT_ID, b, Integer.valueOf(topologyConfig.getProperty(CFG_SIMPLE_ACKER_BOLT_PARALLELISM)))
.setNumTasks(Integer.valueOf(topologyConfig.getProperty(SPOUT_BANDWIDTH_TESTER_BOLT_NUM_TASKS)))
.localOrShuffleGrouping(KAFKA_CDR_SPOUT_ID);
}
Other topology settings:
topology.max.spout.pending=250
topology.executor.receive.buffer.size=1024
topology.executor.send.buffer.size=1024
topology.receiver.buffer.size=8
topology.transfer.buffer.size=1024
topology.acker.executors=1
I'm launching my topology with 1 worker 1 Kafka Spout and 1 Simple Acker Bolt.
Thats what i get in storm UI:
Okey I got 1.5kk tuples in 10min. Bolts capasity is around 0,5. So my logic is simple: If i double spout and bolts parallelism hint - I will get double perfomance.
Next test was with 1 worker 2 Kafka Spout, 2 Simple Acker Bolt and topology.acker.executors=2. Here is results:
So, I get worse perfomance with increased parallelizm hint. Why could it happend? How can I increse tuples per second processing? Actualy any test with spout parallelism hint greater than 2 shows worse result than 1 spout executor.
I've already checked:
1) It's not kafka fault. Topic have 20 partitions on 2 brokers. Topology on 4 workers scales and get x4 perfomance.
2) It's not server fault. Server has 40 cores and 32Gb RAM. While runing topology it consumes around 1/8 CPU and almost none RAM.
3) Changing topology.max.spout.pending paramter doesn't help.
4) Increasing Bolt or Acker parallelism hint even more doesn't help.
So it seems as though you're hitting a limit with your performance of one worker. You're just giving the one worker to much to do and it can't handle it all.
At this point if you want to further increase the performance of your system you have two options.
Add more workers.
Improve your one workers ability to perform work.
If you do not want to add more workers, then what is left to you is to configure your one worker. You should then investigate the configuration for your one worker to give it more memory, more cpu, etc. You should probably look over the default configuration options for Storm and see whether tweaking some configuration values gives you better performance. Some of the configs that seem more likely to help than others:
worker.heap.memory.mb:
worker.childopts:
supervisor.childopts:
supervisor.memory.capacity.mb:
supervisor.cpu.capacity: