I have a spark streaming job which consumes data from kafka and send back to kafka after doing some process on each data.
For this i am doing some map operations on data ,
val lines = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicNameMap, StorageLevel.MEMORY_AND_DISK)
var ad = ""
val abc = lines.map(_._2).map { x =>
val jsonObj = new JSONObject(x)
val data = someMethod(schema, jsonObj)
data
}
then i am doing foreach operation on it , i am not collecting all the data to driver here since i want to send those record inside the executor itself.
abc.foreachRDD(rdd => {
rdd.foreach { toSend =>
val producer = KafkaProducerUtils.getKafkaProducer(kafkaBrokers)
println("toSend---->" + toSend)
producer.send(new ProducerRecord[String, String](topicToSend, toSend))
}
I tried this code for 1405 data for a 10 second period , but it took approximately 2.5 minute to complete the job. I know creating KafkaProducer is costly , Is there any other way around to reduce the processing time. For my testing purpose i am using 2 executors with 2 cores and 1GM each.
After searching a lot i found this article about KafkaSink . This will give you the idea about producing data to kafka inside spark streaming in effective way.
There must be several reasons for this enormous delay processing this amount of messages:
The problem could reside in your consume phase. If you use "createStream", at least, minor versions of Spark use the high level consumer implementation which needs Zookeeper to storage the offset of the consumers which belong to a specific group. So I´d check this communication because it could take too much time in the commit phase. If for any reason that makes commit for each one by one your consume rate could be impaired. So first of all, check this.
There is another reason due to the write ahead log to the file system. Although your configuration indicates memory an disk, as you can see in Spark documentation:
Efficiency: Achieving zero-data loss in the first approach required the data to be stored in a Write Ahead Log, which further replicated the data. This is actually inefficient as the data effectively gets replicated twice - once by Kafka, and a second time by the Write Ahead Log. This second approach eliminates the problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka
For better consume rates I would use createDirectStream instead.
Related
So I have a problem with Kafka Sinks in Spark Streaming while sending JSONs to multiple topics and unreliable kafka brokers. Here are some parts of code:
val kS = KafkaUtils.createDirectStream[String, TMapRecord]
(ssc,
PreferConsistent,
Subscribe[String, TMapRecord](topicsSetT, kafkaParamsInT))
Then I iterate over RDD's
kSMapped.foreachRDD {
rdd: RDD[TMsg] => {
rdd.foreachPartition {
part => {
part.foreach { ...........
And inside foreach I do
kafkaSink.value.send(kafkaTopic, strJSON)
kafkaSinkMirror.value.send(kafkaTopicMirrorBroker, strJSON)
When Mirror broker is down the entire Streaming Application is waiting for it and we are not sending anything to the main broker.
How would you handle it?
For the easiest solution you propose, imagine that me just skip messages that were meant to be sent to a broker that went down (say, that's CASE 1)
for the CASE 2 we'd do some buffering.
P.S. Later on I will use Kafka Mirror, but currently I don't have such an option so I need to make some solution in my code.
I've found several decisions of this problem:
You may use throwing any timeout exception on worker and checkpoints. Spark tries to restart bad task several times described in spark.task.maxFailures property. It is possible to increase number of retries. If streaming job fails after max retries, just will restart the job from checkpoint when broker is available. Or you could manually stop the job when it fails.
You could configure backpressure spark.streaming.backpressure.enabled=true that allow to receive data only as fast as it can process it.
You could send you two results back to your technical Kafka topic and handle it later with another streaming job.
You could make Hive or Hbase buffer for this cases and send unhandled data later in batch mode.
In my code, I first subscribe to a Kafka stream, process each RDD to create an instance of my class People and then, I want to publish the result set (Dataset[People]) to a specific topic to Kafka. It is important to note that not every incoming message received from Kafka maps to an instance of People. Moreover, instances of people should be sent to Kafka in exactly the same order as received from Kafka.
However, I am not sure if sorting is really necessary or if the instances of People maintain the same order when the respective code is run on the executors (and I can directly publish my Dataset to Kafka). As far as I understand, sorting is necessary, because the code inside foreachRDD can be executed on different nodes in the cluster. Is this correct?
Here's my code:
val myStream = KafkaUtils.createDirectStream[K, V](streamingContext, PreferConsistent, Subscribe[K, V](topics, consumerConfig))
def process(record: (RDD[ConsumerRecord[String, String]], Time)): Unit = record match {
case (rdd, time) if !rdd.isEmpty =>
// More Code...
// In the end, I have: Dataset[People]
case _ =>
}
myStream.foreachRDD((x, y) => process((x, y))) // Do I have to replace this call with map, sort the RDD and then publish it to Kafka?
Moreover, instances of people should be sent to Kafka in exactly the same order as received from Kafka.
Unless you have a single partition (and then you wouldn't use Spark, would you?) the order in which data is received is not deterministic, and similarly order in which data is send won't be. Sorting doesn't make any difference here.
If you need a very specific order of processing (it is typically a design mistake, if you work with data intensive applications) you need a sequential application, or system with much more granular control than Spark.
Kafka direct consumer started to limit reads to 450 events(5 * 90 partitions) per batch (5 seconds), it was running fine for 1 or 2 days before that (about 5000 to 40000 events per batch)
I'm using spark standalone cluster (spark and spark-streaming-kafka version 1.6.1) running in AWS and using S3 bucket for checkpoint directory StreamingContext.getOrCreate(config.sparkConfig.checkpointDir, createStreamingContext), there are not scheduling delays and enough disk space on each worker node.
Didn't change any Kafka client initialization parameters, pretty sure that kafka's structure hasn't changed:
val kafkaParams = Map("metadata.broker.list" -> kafkaConfig.broker)
val topics = Set(kafkaConfig.topic)
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
Also can't understand why when direct consumer description says The consumed offsets are by the stream itself I still need to use checkpoint directory when creating the streaming context?
This is usually the result of enabling backpressure via the setting spark.streaming.backpressure.enabled to true. Usually, when the backpressure algorithm sees that there's more data coming in then it's used to, it starts capping each batch to a rather small size until it can "stabilize" itself again. This sometimes has false positives and causes your stream to slow down the processing rate.
If you want to tweak the heuristic a little, there are some undocumented flags it is using (just make sure you know what you're doing):
val proportional = conf.getDouble("spark.streaming.backpressure.pid.proportional", 1.0)
val integral = conf.getDouble("spark.streaming.backpressure.pid.integral", 0.2)
val derived = conf.getDouble("spark.streaming.backpressure.pid.derived", 0.0)
val minRate = conf.getDouble("spark.streaming.backpressure.pid.minRate", 100)
If you want the gory details, PIDRateEstimator is what you're looking for.
I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm looking at using Kafka to be the broker in between these processes. So the high-level job-to-job communication would look like:
Job #1 does some work and publishes message to a Kafka topic
Job #2 is set up as a streaming receiver (using a StreamingContext) to that same Kafka topic, and as soon as the message is published to the topic, Job #2 consumes it
Job #2 can now do some work, based on the message it consumed
From what I can tell, streaming contexts are blocking listeners that run on the Spark Driver node. This means that once I start the streaming consumer like so:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String,
String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.collect().foreach { msg =>
// Now do some work as soon as we receive a messsage from the topic
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
...that there are now 2 implications:
The Driver is now blocking and listening for work to consume from Kafka; and
When work (messages) are received, they are sent to any available Worker Nodes to actually be executed upon
So first, if anything that I've said above is incorrect or is misleading, please begin by correcting me! Assuming I'm more or less correct, then I'm simply wondering if there is a more scalable or performant way to accomplish this, given my criteria. Again, I have two long-runnning jobs (Job #1 and Job #2) that are running on my Spark nodes, and one of them needs to be able to 'send work to' the other one. Any ideas?
From what I can tell, streaming contexts are blocking listeners that
run on the Spark Driver node.
A StreamingContext (singular) isn't a blocking listener. It's job is to create the graph of execution for your streaming job.
When you start reading from Kafka, you specify that you want to fetch new records every 10 seconds. What happens from now on depends on which Kafka abstraction you're using for Kafka, either the Receiver approach via KafkaUtils.createStream, or the Receiver-less approach via KafkaUtils.createDirectStream.
In both approaches in general, data is being consumed from Kafka and then dispatched to each Spark worker to process in parallel.
then I'm simply wondering if there is a more scalable or performant
way to accomplish this
This approach is highly scalable. When using the receiver-less approach, each Kafka partition maps to a Spark partition in a given RDD. You can increase parallelism by either increasing the amount of partitions in Kafka, or by re-partitions the data inside Spark (using DStream.repartition). I suggest testing this setup to determine if it suits your performance requirements.
I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.