Why foreachRDD in Spark Streaming + Kafka is slow, should it be? - scala

I am using Spark 2.1 and Kafka 0.08.xx to do a Spark Streaming job. It is a text filtering job, and most of the text will be filtered out during the process. I implemented in two different ways:
Do filtering directly on the output of DirectStream:
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
val jsonMsg = messages.map(_._2)
val filteredMsg = jsonMsg.filter(x=>x.contains(TEXT1) && x.contains(TEXT2) && x.contains(TEXT3))
use the foreachRDD function
messages.foreachRDD { rdd =>
val record = rdd.map(_.2).filter(x => x.contains(TEXT1) &&
x.contains(TEXT2) &&
x.contains(TEXT3) )}
I found the first method is noticeably faster than the second method, but I am not sure this is the common case.
Is there any difference between method 1 and method 2?

filter is a transformation. Transformations are evaluated lazily, that is, they don't do anything until you perform an action, such as foreachRDD, writing the data, etc.
So in 1. actually nothing is happening, hence significantly faster than 2., which is using the action foreachRDD to do something.

Related

How to collect all records at spark executor and process it as batch

In my spark kinesis streaming application I am using foreachBatch to get the streaming data and need to send it to the drools rule engine for further processing.
My requirement is, I need to accumulate all json data in a list/ruleSession and send it for rule engine for processing as a batch at the executor side.
//Scala Code Example:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreachBatch(function)
.start()
query.awaitTermination()
val function = (batchDF: DataFrame, batchId: Long) => {
val ruleSession = kBase.newKieSession() //Drools Rule Session, this is getting created at driver side
batchDF.foreach(row => { // This piece of code is being run in executor.
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Getting a null pointer exception here as the ruleSession is not available in executor.
}
)
ruleHandler.processRule(ruleSession) // Again this is in the driver scope.
}
In the above code, the problem I am facing is: the function used in foreachBatch is getting executed at driver side and the code inside batchDF.foreach is getting executed at worker/executor side, and thus failing to get he ruleSession.
Is there any way to run the whole function at each executor side?
OR
Is there a better way to accumulate all the data in a batch DataFrame after transformation and send it to next process from within the executor/worker?
I think this might work ... Rather than running foreach, you could use foreachBatch or foreachPartition (or or a map version like mapPartition if you want return info). In this portion, open a connection to the drools system. From that point, iterate over the dataset within each partition (or batch) sending each to the drools system (or you might send that whole chunk to drools). In the foreachPartition / foreachBatch section, at the end, close the connect (if applicable).
#codeaperature, This is how I achieved batching, inspired from your answer, posting it as an answer as this exceeds the word limit in a comment.
Using foreach on dataframe and passing in a ForeachWriter.
Initializing the rule session in open method of ForeachWriter.
Adding each input JSON to rule session in process method.
Execute the rule in close method with the rule session loaded with batch of data.
//Scala code:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(dataConsumer)
.start()
val dataConsumer = new ForeachWriter[Row] {
var ruleSession: KieSession = null;
def open(partitionId: Long, version: Long): Boolean = { // first open is called once for every batch
ruleSession = kBase.newKieSession()
true
}
def process(row: Row) = { // the process method will be called for a batch of records
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Add all input json to rule session.
}
def close(errorOrNull: Throwable): Unit = { // after calling process for all records in bathc close is called
val factCount = ruleSession.getFactCount
if (factCount > 0) {
ruleHandler.processRule(ruleSession) //batch processing of rule
}
}
}

How to compare two RDDs in Spark, when data of second stream may not yet be available?

I am working on a Spark app that streams data from two different topics topic_a and topic_b of a Kafka server. I want to consume both streams and check if the data coming from both topics is equal.
val streamingContext = new StreamingContext(sparkContext, Seconds(batchDuration))
val eventStream = KafkaUtils.createDirectStream[String, String](streamingContext, PreferConsistent, Subscribe[String, String](topics, consumerConfig))
def start(record: (RDD[ConsumerRecord[String, String]], Time)): Unit = {
// ...
def cmp(rddA: RDD[ConsumerRecord[String, String]], rddB: RDD[ConsumerRecord[String, String]]): Unit = {
// Do compare...
// but rddA or rddB may be empty! :-(
}
val rddTopicA = rdd.filter(_.topic == 'topic_a')
val rddTopicB = rdd.filter(_.topic == 'topic_b')
cmp(rddTopicA, rddTopicB)
}
eventStream.foreachRDD((x, y) => start((x, y)))
streamingContext.start()
streamingContext.awaitTermination()
The problem is that, when comparing both RDDs in cmp, one of the RDDs may be empty, as the data stream may not yet be available in Kafka. Is it possible to somehow wait until both RDDs have the same amount of rows and then start the comparison? Or first convert the RDD that has data into a DataSet and then temporarily store it for later comparison?

Spark Streaming - How to get results out of foreachRDD function? [duplicate]

I am trying to access a collection of filtered DStreams obtained like in the solution to this question: Spark Streaming - Best way to Split Input Stream based on filter Param
I create the Collection as follows:
val statuCodes = Set("200","500", "404")
spanTagStream.cache()
val statusCodeStreams = statuCodes.map(key => key -> spanTagStream.filter(x => x._3.get("http.status_code").getOrElse("").asInstanceOf[String].equals(key)))
I try to access statusCodeStreams in the following way:
for(streamTuple <- statusCodeStreams){
streamTuple._2.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
x=>{
/* Code Writing to Kafka using streamTuple._1 as the topic-String */
}
}
})
)
}
When executing this I receive the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka010.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects
How do I access the Streams to write to Kafka in a serializable way?
As the exception indicates, the DStream definition is being captured by the closure.
A simple option is to declare this DStream transient:
#transient val spamTagStream = //KafkaUtils.create...
#transient flags certain objects to be removed from the Java serialization of the object graph of some object. The key of this scenario is that some val declared in the same scope as the DStream (statusCodeStreams in this case) is used within the closure. The actual reference of that val from within the closure is outer.statusCodeStreams, causing that the serialization process to "pull" all context of outer into the closure. With #transient we mark the DStream (and also the StreamingContext) declarations as non-serializable and we avoid the serialization issue. Depending on the code structure (if it's all linear in one main function (bad practice, btw) it might be necessary to mark ALL DStream declarations + the StreamingContext instance as #transient.
If the only intent of the initial filtering is to 'route' the content to separate Kafka topics, it might be worth moving the filtering within the foreachRDD. That would make for a simpler program structure.
spamTagStream.foreachRDD{ rdd =>
rdd.cache()
statuCodes.map{code =>
val matchingCodes = rdd.filter(...)
matchingCodes.foreachPartition{write to kafka}
}
rdd.unpersist(true)
}

Accessing Collection of DStreams

I am trying to access a collection of filtered DStreams obtained like in the solution to this question: Spark Streaming - Best way to Split Input Stream based on filter Param
I create the Collection as follows:
val statuCodes = Set("200","500", "404")
spanTagStream.cache()
val statusCodeStreams = statuCodes.map(key => key -> spanTagStream.filter(x => x._3.get("http.status_code").getOrElse("").asInstanceOf[String].equals(key)))
I try to access statusCodeStreams in the following way:
for(streamTuple <- statusCodeStreams){
streamTuple._2.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
x=>{
/* Code Writing to Kafka using streamTuple._1 as the topic-String */
}
}
})
)
}
When executing this I receive the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka010.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects
How do I access the Streams to write to Kafka in a serializable way?
As the exception indicates, the DStream definition is being captured by the closure.
A simple option is to declare this DStream transient:
#transient val spamTagStream = //KafkaUtils.create...
#transient flags certain objects to be removed from the Java serialization of the object graph of some object. The key of this scenario is that some val declared in the same scope as the DStream (statusCodeStreams in this case) is used within the closure. The actual reference of that val from within the closure is outer.statusCodeStreams, causing that the serialization process to "pull" all context of outer into the closure. With #transient we mark the DStream (and also the StreamingContext) declarations as non-serializable and we avoid the serialization issue. Depending on the code structure (if it's all linear in one main function (bad practice, btw) it might be necessary to mark ALL DStream declarations + the StreamingContext instance as #transient.
If the only intent of the initial filtering is to 'route' the content to separate Kafka topics, it might be worth moving the filtering within the foreachRDD. That would make for a simpler program structure.
spamTagStream.foreachRDD{ rdd =>
rdd.cache()
statuCodes.map{code =>
val matchingCodes = rdd.filter(...)
matchingCodes.foreachPartition{write to kafka}
}
rdd.unpersist(true)
}

Spark is duplicating work

I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.