Spark Streaming Kafka Receivers API - numPartitions - scala

We are using spark-streaming-kafka-0-8 Receivers. We are not able to increase the amount of consumed events by increasing numPartitions. It seems increasing numPartitions doesn't affect the performance.
The KafkaUtils.createStream method has a topic_name to numPartitions map, while each partition should be consumed in its own thread.
Currently we are working with:
KafkaUtils.createStream[Integer, Event, IntegerDecoder, EventDecoder](ssc,
Configuration.kafkaConfig, scala.collection.immutable.Map(topic -> 1),
StorageLevel.MEMORY_AND_DISK)
I would expect using scala.collection.immutable.Map(topic -> 10) will pull much more events than when using 1 thread, but it doesn't improve the performance (I made sure that 10 threads are in fact used per receiver)
However, If I create more Kafka receivers (which from my understanding is exactly equivalent to increasing threads) the performance does improve.
Is this a problem with version 0-8?
Should increasing numPartitions improve amount of consumed events?
Why does adding receivers improve performance while increasing numPartition doesn't?

Is this a problem with version 0-8?
No, it is a "problem" with the receiver based approach, which is what you're using with createStream. The said approach will create a single thread for consumption on a given executor node. If you want to read concurrently, you have to create multiple such receivers.
Per the documentation:
Topic partitions in Kafka does not correlate to partitions of RDDs
generated in Spark Streaming. So increasing the number of
topic-specific partitions in the KafkaUtils.createStream() only
increases the number of threads using which topics that are consumed
within a single receiver. It does not increase the parallelism of
Spark in processing the data
If you want to increase concurrency, use the direct (receiverless) based approach (using KafkaUtils.createDirectStream) which dispatches a each TopicPartition to a given executor node for consumption, thus allowing all executors to participate in consuming from Kafka

Related

Can I use Kafka for multiple independent consumers sequential reads?

I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html

Increase or decrease Kafka partitions dynamically

I have a system where load is not constant. We may get 1000 requests a day or no requests at all.
We use Kafka to pass on the requests between services. We have kept average number of Kafka consumers to reduce cost incurred. Now my consumers of Kafka will sit ideal if there are no requests received that day, and there will be lag if too many requests are received.
We want to keep these consumers on Autoscale mode, such that my number of servers(Kafka consumer) will increase if there is a spike in number of requests. Once the number of requests get reduced, we will remove the servers. Therefore, the Kafka partitions have to be increased or decreased accordingly
Kafka allows increasing partition. In this case, how can we decrease Kafka partitions dynamically?
Is there any other solution to handle this Auto-scaling?
Scaling up partitions will not fix your lag problems in the short term since no data is moved between partitions when you do this, so existing (or new) consumers are still stuck with reading the data in the previous partitions.
It's not possible to decrease partitions and its not possible to scale consumers beyond the partition count.
If you are able to sacrifice processing order for consumption speed, you can separate the consuming threads and working threads, as hinted at in the KafkaConsumer javadoc, then you would be able to scale these worker threads.
Since you are thinking about modifying the partition counts, then I'm guessing processing order isn't a problem.
have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing.
A single consumer can consume multiple partitions. Therefore partition your work for the largest anticipated parallel requirement, and then scale your number of consumers as required.
For example, if you think you need 32 parallel consumers you would give you Kafka topic 32 partitions.
You can run 32 consumers (each gets one partition), or eight consumers (each gets four partitions) or just one (which gets all 32 partitions). Or any number of consumers in between. Kafka's protocol ensures that within the consumer group all partitions are consumed, and will rebalance as and when consumers are added or removed.

Kafka Streams: event-time skew when processing messages from different partitions

Let's consider a topic with multiple partitions and messages written in event-time order without any particular partitioning scheme. Kafka Streams application does some transformations on these messages, then groups by some key, and then aggregates messages by an event-time window with the given grace period.
Each task could process incoming messages at a different speed (e.g., because running on servers with different performance characteristics). This means that after groupBy shuffle, event-time ordering will not be preserved between messages in the same partition of the internal topic when they originate from different tasks. After a while, this event-time skew could become larger than the grace period, which would lead to dropping messages originating from the lagging task.
Increasing the grace period doesn't seem like a valid option because it would delay emitting the final aggregation result. Apache Flink handles this by emitting the lowest watermark on partitions merge.
Should it be a real concern, especially when processing large amounts of historical data, or do I miss something? Does Kafka Streams offer a solution to deal with this scenario?
UPDATE My question is not about KStream-KStream joins but about single KStream event-time aggregation preceded by a stream shuffle.
Consider this code snippet:
stream
.mapValues(...)
.groupBy(...)
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).grace(Duration.ofSeconds(10)))
.aggregate(...)
I assume mapValues() operation could be slow for some tasks for whatever reason, and because of that tasks do process messages at a different pace. When a shuffle happens at the aggregate() operator, task 0 could have processed messages up to time t while task 1 is still at t-skew, but messages from both tasks end up interleaved in a single partition of the internal topic (corresponding to the grouping key).
My concern is that when skew is large enough (more than 10 seconds in my example), messages from the lagging task 1 will be dropped.
Basically, a task/processor maintains a stream-time which is defined as the highest timestamp of any record already polled. This stream-time is then used for different purpose in Kafka Streams (e.g: Punctator, Windowded Aggregation, etc).
[Windowed Aggregation]
As you mentioned, the stream-time is used to determine if a record should be accepted, i.e record_accepted = end_window_time(current record) + grace_period > observed stream_time.
As you described it, if several tasks run in parallel to shuffle messages based on a grouping key, and some tasks are slower than others (or some partitions are offline) this will create out-of-order messages. Unfortunately, I'm afraid that the only way to deal with that is to increase the grace_period.
This is actually the eternal trade-off between Availability and Consistency.
[Behaviour for KafkaStream and KafkaStream/KTable Join
When you are perfoming a join operation with Kafka Streams, an internal Task is assigned to the "same" partition over multiple co-partitioned Topics. For example the Task 0 will be assigned to Topic1-Partition0 and TopicB-Partition0.
The fetched records are buffered per partition into internal queues that are managed by Tasks. So, each queue contains all records for a single partition waiting for processing.
Then, records are polled one by one from queues and processed by the topology instance. But, this is the record from the non-empty queue having the lowest timestamp which is returned from the polled.
In addition, if a queue is empty, the task may become idle during a period of time so that no more records are polled from queue. You can actually configure the maximum amount of time a Task will stay idle can be defined with the stream config :max.task.idle.ms
This mecanism allows synchronizing co-localized partitions. Bu, default the max.task.idle.ms is set to 0. This means a Task will never wait for more data from a partition which may lead to records being filtered because the stream-time will potentially increase more quickly.

KafkaIO uneven partition consumption after a while

I have a simple dataflow pipeline (job id 2018-05-15_06_17_40-8591349846083543299) with 1 min worker and 7 max workers that does the following:
Consume from 4 Kafka topics using KafkaIO. Each topic is represented differently and is a separate PCollection
Perform transformation on each PCollection to output a standard representation PCollection.
Merge the 4 PCollection using Flatten.pCollections
Window into hourly with the following trigger:
Repeatedly
.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(40000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))
)
)
.orFinally(AfterWatermark.pastEndOfWindow())
Write these events to GCS using AvroIO windowed writes with 14 shards.
When the pipeline is launched initially everything is fine, but after several hours later, the System Lag increases dramatically in the AvroIO:GroupIntoShards step.
Upon further investigation one of the topics is lagging behind many hours (this topic has the greatest events per second when compared to the other 3). Looking at the logs I see Closing idle reader for S12-000000000000000a which is understandable. However, the topic's consumer group offsets for the 36 partitions is in a state where for some partitions the offset is very low, but some are very high. The log-end-offset is more or less evenly distributed and the records we are producing are around the same size.
Questions:
If the System Lag is high in a certain step, does that prevent the Kafka consumers from consuming?
Any possible reason for the uneven distribution in Kafka offsets?
The PCollection's that is merged have different traffic patterns, some low and some high. Would adding the AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5) trigger effectively start writing to GCS for each (window, shard) after 5 minutes when an event is first seen in a window?
Updating the pipeline using the same code / configuration brings it back into a normal state where the consumed rate is much higher (due to the lag before the restart) than the produced rate.
Addressing 3 questions raised (I left a comment about the specific job):
No, system lag does not prevent Kafka from consuming.
In general if there is lots of work for downstream stages ready to be processed, that can delay upstream work from starting. But that is not KafkaIO specific.
Does not seem to be the case here. In general, assuming there is no skew among Kafka partitions themselves, heavy skew in Beam processing can cause readers assigned to workers that are doing more work than others.
I think yes. I think firstElementInPane() applies to element from any of the sources.

Is there a way to further parallelize kstreams aside from partitions?

I understand that the fundamental approach to parallelization with kafka is to utilize partitioning. However, I have a special situation in that I have to leverage an existing infrastructure that only has 6 partitions, and I need to process millions and millions of records per second.
Is there a way to further optimize in a way that I could have each kstream consumer read and equally distribute load at the same time from a single partition?
The simplest way is to create a "helper" topic with the desired number of partitions. This topic can be configured with a very short retention time, because the original data is safely stored in the actual input topic. You use this helper topic to route all data through it and thus allow for more parallelism downstream:
builder.stream("input-topic")
.through("helper-topic-with-many-partitions")
... // actual processing
Partitions are the level of parallelization. With 6 partitions - you could maximum have 6 instances (of kstream) consuming data. If each instance is in a separate machine i.e. with 1 GBps network each, you could be reading in total with 600 Mbytes / sec
If that's not enough, you'd need to repartition data
Now for distributing your processing, you would need to run each kstream (with the same consumer group) on a different machine
Here's a short video that demonstrates how Kafka Streams (via Kafka SQL) are parallelized to 5 processes https://www.youtube.com/watch?v=denwxORF3pU
It all depends on partitions & executors. With 6 partitions, I usually can achieve 500K+ messages / second, depending on the complexity of the processing of course