Spark Streaming Kafka direct consumer consumption speed drop - scala

Kafka direct consumer started to limit reads to 450 events(5 * 90 partitions) per batch (5 seconds), it was running fine for 1 or 2 days before that (about 5000 to 40000 events per batch)
I'm using spark standalone cluster (spark and spark-streaming-kafka version 1.6.1) running in AWS and using S3 bucket for checkpoint directory StreamingContext.getOrCreate(config.sparkConfig.checkpointDir, createStreamingContext), there are not scheduling delays and enough disk space on each worker node.
Didn't change any Kafka client initialization parameters, pretty sure that kafka's structure hasn't changed:
val kafkaParams = Map("metadata.broker.list" -> kafkaConfig.broker)
val topics = Set(kafkaConfig.topic)
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
Also can't understand why when direct consumer description says The consumed offsets are by the stream itself I still need to use checkpoint directory when creating the streaming context?

This is usually the result of enabling backpressure via the setting spark.streaming.backpressure.enabled to true. Usually, when the backpressure algorithm sees that there's more data coming in then it's used to, it starts capping each batch to a rather small size until it can "stabilize" itself again. This sometimes has false positives and causes your stream to slow down the processing rate.
If you want to tweak the heuristic a little, there are some undocumented flags it is using (just make sure you know what you're doing):
val proportional = conf.getDouble("spark.streaming.backpressure.pid.proportional", 1.0)
val integral = conf.getDouble("spark.streaming.backpressure.pid.integral", 0.2)
val derived = conf.getDouble("spark.streaming.backpressure.pid.derived", 0.0)
val minRate = conf.getDouble("spark.streaming.backpressure.pid.minRate", 100)
If you want the gory details, PIDRateEstimator is what you're looking for.

Related

Spark Kafka Streaming - There a is a lot of delay in processing the batch

I am running spark streaming with kafka fro word count program, there is a lot of delay in batch creation and processing - around 2 mins for each batch.
How could i reduce this time ? Are there any properties to be configured this to be quickly as possible - like properties at spark streaming level or kafka level ?
you should define the interval between each batch in your unstructured StreamingContext, exemple :
val ssc = StreamingContext(new SparkConf(), Minutes(1))
in strutured streaming you have a option: kafkaConsumer.pollTimeoutMs
with 512 ms as default value, more informations: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Another problem can come from the kafka lag. you application can take a long time to process a specific offset, maybe 2 minutes, so as soon as this offset is finish, he will poll others for processing. Try to look at the current offset of your consumer group and the last offset of your topic.

spark-streaming-kafka-0-10: How to limit number of Spark partitions

Is it possible to configure Spark with the spark-streaming-kafka-0-10 library to read multiple Kafka partitions or an entire Kafka topic with a single task instead of creating a different Spark task for every Kafka partition available?
Please excuse my rough understanding of these technologies; I think I'm still new to Spark and Kafka. The architecture and settings are mostly just messing around to explore and see how these technologies work together.
I have a four virtual hosts, one with a Spark master and each with a Spark worker. One of the hosts is also running a Kafka broker, based on Spotify's Docker image. Each host has four cores and about 8 GB of unused RAM.
The Kafka broker has 206 topics, and each topic has 10 partitions. So there are a total of 2,060 partitions for applications to read from.
I'm using the spark-streaming-kafka-0-10 library (currently experimental) to subscribe to topics in Kafka from a Spark Streaming job. I am using the SubscribePattern class to subscribe to all 206 topics from Spark:
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
SubscribePattern[String, String](Pattern.compile("(pid\\.)\\d+"),
kafkaParams)
)
When I submit this job to the Spark master, it looks like 16 executors are started, one for each core in the cluster. It also looks like each Kafka partition gets its own task, for a total of 2,060 tasks. I think my cluster of 16 executors is having trouble churning through so many tasks because the job keeps failing at different points between 1500 and 1800 tasks completed.
I found a tutorial by Michael Noll from 2014 which addresses using the spark-streaming-kafka-0-8 library to control the number of consumer threads for each topic:
val kafkaParams: Map[String, String] = Map("group.id" -> "terran", ...)
val consumerThreadsPerInputDstream = 3
val topics = Map("zerg.hydra" -> consumerThreadsPerInputDstream)
val stream = KafkaUtils.createStream(ssc, kafkaParams, topics, ...)
Is it possible to configure Spark with the spark-streaming-kafka-0-10 library to read multiple Kafka partitions or an entire Kafka topic with a single task instead of creating a different Spark task for every Kafka partition available?
You could alter the number of generated partitions by calling repartition on the stream, but then you lose the 1:1 correspondence between Kafka and RDD partition.
The number of tasks generated by Kafka partitions aren't related to the fact you have 16 executors. The number of executors depend on your settings and the resource manager you're using.
There is a 1:1 mapping between Kafka partitions and RDD partitions with the direct streaming API, each executor will get a subset of these partitions to consume from Kafka and process where each partition is independent and can be computed on it's own. This is unlike the receiver based API which creates a single receiver on an arbitrary executor and consumes the data itself via threads on the node.
If you have 206 topics and 10 partitions each, you better have a decent sized cluster which can handle the load of the generated tasks. You can control the max messages generated per partition, but you can alter the number of partitions unless you're will to call the shuffling effect of the repartition transformation.
The second approach will be the best for your requirements. Only you have to set consumerThreadsPerInputDstream = 1. So only one thread will be created per read operations, hence single machine will be involved per cluster.

Sending to Kafka from a spark job taking too much time

I have a spark streaming job which consumes data from kafka and send back to kafka after doing some process on each data.
For this i am doing some map operations on data ,
val lines = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicNameMap, StorageLevel.MEMORY_AND_DISK)
var ad = ""
val abc = lines.map(_._2).map { x =>
val jsonObj = new JSONObject(x)
val data = someMethod(schema, jsonObj)
data
}
then i am doing foreach operation on it , i am not collecting all the data to driver here since i want to send those record inside the executor itself.
abc.foreachRDD(rdd => {
rdd.foreach { toSend =>
val producer = KafkaProducerUtils.getKafkaProducer(kafkaBrokers)
println("toSend---->" + toSend)
producer.send(new ProducerRecord[String, String](topicToSend, toSend))
}
I tried this code for 1405 data for a 10 second period , but it took approximately 2.5 minute to complete the job. I know creating KafkaProducer is costly , Is there any other way around to reduce the processing time. For my testing purpose i am using 2 executors with 2 cores and 1GM each.
After searching a lot i found this article about KafkaSink . This will give you the idea about producing data to kafka inside spark streaming in effective way.
There must be several reasons for this enormous delay processing this amount of messages:
The problem could reside in your consume phase. If you use "createStream", at least, minor versions of Spark use the high level consumer implementation which needs Zookeeper to storage the offset of the consumers which belong to a specific group. So I´d check this communication because it could take too much time in the commit phase. If for any reason that makes commit for each one by one your consume rate could be impaired. So first of all, check this.
There is another reason due to the write ahead log to the file system. Although your configuration indicates memory an disk, as you can see in Spark documentation:
Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka
For better consume rates I would use createDirectStream instead.

Using Kafka to communicate between long running Spark jobs

I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm looking at using Kafka to be the broker in between these processes. So the high-level job-to-job communication would look like:
Job #1 does some work and publishes message to a Kafka topic
Job #2 is set up as a streaming receiver (using a StreamingContext) to that same Kafka topic, and as soon as the message is published to the topic, Job #2 consumes it
Job #2 can now do some work, based on the message it consumed
From what I can tell, streaming contexts are blocking listeners that run on the Spark Driver node. This means that once I start the streaming consumer like so:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String,
String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.collect().foreach { msg =>
// Now do some work as soon as we receive a messsage from the topic
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
...that there are now 2 implications:
The Driver is now blocking and listening for work to consume from Kafka; and
When work (messages) are received, they are sent to any available Worker Nodes to actually be executed upon
So first, if anything that I've said above is incorrect or is misleading, please begin by correcting me! Assuming I'm more or less correct, then I'm simply wondering if there is a more scalable or performant way to accomplish this, given my criteria. Again, I have two long-runnning jobs (Job #1 and Job #2) that are running on my Spark nodes, and one of them needs to be able to 'send work to' the other one. Any ideas?
From what I can tell, streaming contexts are blocking listeners that
run on the Spark Driver node.
A StreamingContext (singular) isn't a blocking listener. It's job is to create the graph of execution for your streaming job.
When you start reading from Kafka, you specify that you want to fetch new records every 10 seconds. What happens from now on depends on which Kafka abstraction you're using for Kafka, either the Receiver approach via KafkaUtils.createStream, or the Receiver-less approach via KafkaUtils.createDirectStream.
In both approaches in general, data is being consumed from Kafka and then dispatched to each Spark worker to process in parallel.
then I'm simply wondering if there is a more scalable or performant
way to accomplish this
This approach is highly scalable. When using the receiver-less approach, each Kafka partition maps to a Spark partition in a given RDD. You can increase parallelism by either increasing the amount of partitions in Kafka, or by re-partitions the data inside Spark (using DStream.repartition). I suggest testing this setup to determine if it suits your performance requirements.

Serial consumption of Kafka topics from Spark

Given the following code:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String, String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.foreach { msg =>
// Now do some DataFrame-intensive work.
// As I understand things, DataFrame ops must be run
// on Workers as well as streaming consumers.
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
My understanding is that Spark and Kafka will automagically work together to figure out how many consumer threads to deploy to available Worker Nodes, which likely results in parallel processing of messages off a Kafka topic.
But what if I don't want multiple, parallel consumers? What if want 1-and-only-1 consumer reading the next message from a topic, processing it completely, and then starting back over again and polling for the next message.
Also, when I call:
val ssc = new StreamingContext(sc, Seconds(10))
Does this mean:
That a single consumer thread will receive all messages that were published to the topic in the last 10 seconds; or
That a single consumer thread will receive the next (single) message from the topic, and that it will poll for the next message every 10 seconds?
But what if I don't want multiple, parallel consumers? What if want
1-and-only-1 consumer reading the next message from a topic,
processing it completely, and then starting back over again and
polling for the next message.
If that is your use-case, I'd say why use Spark at all? Its entire advantage is that you can read in parallel. The only hacky workaround I can think of is creating a Kafka topic with a single partition, which would make Spark assign the entire offset range to a single worker, but that is ugly.
Does that mean that a single consumer thread will receive all messages that were
published to the topic in the last 10 seconds or that a single
consumer thread will receive the next (single) message from the topic,
and that it will poll for the next message every 10 seconds?
Neither. Since you're using direct (receiverless) stream approach, it means that every 10 seconds, your driver will ask Kafka to give him the offset ranges that have changed since the last batch, for each partition of the said topic. Then, Spark will take each such offset range, and send it to one of the workers to consume directly from Kafka. This means that with the direct stream approach, there is a 1:1 correspondence between Kafka partitions and Spark partitions.