What is the difference between saveAsObjectFile and persist in apache spark? - scala

I am trying to compare java and kryo serialization and on saving the rdd on disk, it is giving the same size when saveAsObjectFile is used, but on persist it shows different in spark ui. Kryo one is smaller then java, but the irony is processing time of java is lesser then that of kryo which was not expected from spark UI?
val conf = new SparkConf()
.setAppName("kyroExample")
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(
Array(classOf[Person],classOf[Array[Person]])
)
val sparkContext = new SparkContext(conf)
val personList: Array[Person] = (1 to 100000).map(value => Person(value + "", value)).toArray
val rddPerson: RDD[Person] = sparkContext.parallelize(personList)
val evenAgePerson: RDD[Person] = rddPerson.filter(_.age % 2 == 0)
case class Person(name: String, age: Int)
evenAgePerson.saveAsObjectFile("src/main/resources/objectFile")
evenAgePerson.persist(StorageLevel.MEMORY_ONLY_SER)
evenAgePerson.count()

persist and saveAsObjectFile solve different needs.
persist has a misleading name. It is not supposed to be used to permanently persist the rdd result. Persist is used to temporarily persist a computation result of the rdd during a spark workfklow. The user has no control, over the location of the persisted dataframe. Persist is just caching with different caching strategy - memory, disk or both. In fact cache will just call persist with a default caching strategy.
e.g
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs again on the full dataframe of all the lines , repeats the above operation
errors.filter(col("line").like("%MySQL%")).count()
vs
val errors = df.filter(col("line").like("%ERROR%"))
errors.persist()
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
// Runs only on the errors tmp result containing only the filtered error lines
errors.filter(col("line").like("%MySQL%")).count()
saveAsObjectFile is for permanent persistance. It is used for serializing the final result of a spark job onto a persistent and usually distributed filessystem like hdfs or amazon s3

Spark persist and saveAsObjectFile is not the same at all.
persist - persist your RDD DAG to the requested StorageLevel, thats mean from now and on any transformation apply on this RDD will be calculated only from the persisted DAG.
saveAsObjectFile - just save the RDD into a SequenceFile of serialized object.
saveAsObjectFile doesn`t use "spark.serializer" configuration at all.
as you can see the below code:
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}
saveAsObjectFile uses Utils.serialize to serialize your object, when serialize method defenition is:
/** Serialize an object using Java serialization */
def serialize[T](o: T): Array[Byte] = {
val bos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(bos)
oos.writeObject(o)
oos.close()
bos.toByteArray
}
saveAsObjectFile use the Java Serialization always.
On the other hand, persist will use the configured spark.serializer you configure him.

Related

How to collect all records at spark executor and process it as batch

In my spark kinesis streaming application I am using foreachBatch to get the streaming data and need to send it to the drools rule engine for further processing.
My requirement is, I need to accumulate all json data in a list/ruleSession and send it for rule engine for processing as a batch at the executor side.
//Scala Code Example:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreachBatch(function)
.start()
query.awaitTermination()
val function = (batchDF: DataFrame, batchId: Long) => {
val ruleSession = kBase.newKieSession() //Drools Rule Session, this is getting created at driver side
batchDF.foreach(row => { // This piece of code is being run in executor.
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Getting a null pointer exception here as the ruleSession is not available in executor.
}
)
ruleHandler.processRule(ruleSession) // Again this is in the driver scope.
}
In the above code, the problem I am facing is: the function used in foreachBatch is getting executed at driver side and the code inside batchDF.foreach is getting executed at worker/executor side, and thus failing to get he ruleSession.
Is there any way to run the whole function at each executor side?
OR
Is there a better way to accumulate all the data in a batch DataFrame after transformation and send it to next process from within the executor/worker?
I think this might work ... Rather than running foreach, you could use foreachBatch or foreachPartition (or or a map version like mapPartition if you want return info). In this portion, open a connection to the drools system. From that point, iterate over the dataset within each partition (or batch) sending each to the drools system (or you might send that whole chunk to drools). In the foreachPartition / foreachBatch section, at the end, close the connect (if applicable).
#codeaperature, This is how I achieved batching, inspired from your answer, posting it as an answer as this exceeds the word limit in a comment.
Using foreach on dataframe and passing in a ForeachWriter.
Initializing the rule session in open method of ForeachWriter.
Adding each input JSON to rule session in process method.
Execute the rule in close method with the rule session loaded with batch of data.
//Scala code:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(dataConsumer)
.start()
val dataConsumer = new ForeachWriter[Row] {
var ruleSession: KieSession = null;
def open(partitionId: Long, version: Long): Boolean = { // first open is called once for every batch
ruleSession = kBase.newKieSession()
true
}
def process(row: Row) = { // the process method will be called for a batch of records
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Add all input json to rule session.
}
def close(errorOrNull: Throwable): Unit = { // after calling process for all records in bathc close is called
val factCount = ruleSession.getFactCount
if (factCount > 0) {
ruleHandler.processRule(ruleSession) //batch processing of rule
}
}
}

SparkException: This RDD lacks a SparkContext

I'm trying to create a string sampler using a rdd of string as a dictionnary and the class RandomDataGenerator from org.apache.spark.mllib.random package.
import org.apache.spark.mllib.random.RandomDataGenerator
import org.apache.spark.rdd.RDD
import scala.util.Random
class StringSampler(var dic: RDD[String], var seed: Long = System.nanoTime) extends RandomDataGenerator[String] {
require(dic != null, "Dictionary cannot be null")
require(!dic.isEmpty, "Dictionary must contains lines (words)")
Random.setSeed(seed)
var fraction: Long = 1 / dic.count()
//return a random line from dictionary
override def nextValue(): String = dic.sample(withReplacement = true, fraction).take(1)(0)
override def setSeed(newSeed: Long): Unit = Random.setSeed(newSeed)
override def copy(): StringSampler = new StringSampler(dic)
def setDictionary(newDic: RDD[String]): Unit = {
require(newDic != null, "Dictionary cannot be null")
require(!newDic.isEmpty, "Dictionary must contains lines (words)")
dic = newDic
fraction = 1 / dic.count()
}
}
val dictionaryName: String
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName) // dictionary.cache()
val sampler = new StringSampler(dictionary)
RandomRDDs.randomRDD(context, sampler, size, numPartitions)
But I encounter a SparkException saying that the dictionary is lacking of a SparkContext when I try to generate a random RDD of strings. It seems that spark is loosing context of the dictionary rdd when copying it to cluster nodes and I don't know how to fix it.
I tried to cache the dictionary before passing it to the StringSampler, but it didn't change anything...
I was thinking about linking it back to the original SparkContext, but I don't even know if it's possible. Anyone have an idea ?
Caused by: org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
I believe the issue is here:
val dictionaries: Broadcast[Map[String, RDD[String]]]
val dictionary: RDD[String] = dictionaries.value(dictionaryName)
You should not be broadcasting anything containing an RDD. RDDs are already parallelized and spread throughout the cluster. The error comes from trying to serialize and deserialize an RDD, which loses its context and is pointless anyway.
Just do this:
val dictionaries: Map[String, RDD[String]]
val dictionary: RDD[String] = dictionaries(dictionaryName)

Spark Streaming - How to get results out of foreachRDD function? [duplicate]

I am trying to access a collection of filtered DStreams obtained like in the solution to this question: Spark Streaming - Best way to Split Input Stream based on filter Param
I create the Collection as follows:
val statuCodes = Set("200","500", "404")
spanTagStream.cache()
val statusCodeStreams = statuCodes.map(key => key -> spanTagStream.filter(x => x._3.get("http.status_code").getOrElse("").asInstanceOf[String].equals(key)))
I try to access statusCodeStreams in the following way:
for(streamTuple <- statusCodeStreams){
streamTuple._2.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
x=>{
/* Code Writing to Kafka using streamTuple._1 as the topic-String */
}
}
})
)
}
When executing this I receive the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka010.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects
How do I access the Streams to write to Kafka in a serializable way?
As the exception indicates, the DStream definition is being captured by the closure.
A simple option is to declare this DStream transient:
#transient val spamTagStream = //KafkaUtils.create...
#transient flags certain objects to be removed from the Java serialization of the object graph of some object. The key of this scenario is that some val declared in the same scope as the DStream (statusCodeStreams in this case) is used within the closure. The actual reference of that val from within the closure is outer.statusCodeStreams, causing that the serialization process to "pull" all context of outer into the closure. With #transient we mark the DStream (and also the StreamingContext) declarations as non-serializable and we avoid the serialization issue. Depending on the code structure (if it's all linear in one main function (bad practice, btw) it might be necessary to mark ALL DStream declarations + the StreamingContext instance as #transient.
If the only intent of the initial filtering is to 'route' the content to separate Kafka topics, it might be worth moving the filtering within the foreachRDD. That would make for a simpler program structure.
spamTagStream.foreachRDD{ rdd =>
rdd.cache()
statuCodes.map{code =>
val matchingCodes = rdd.filter(...)
matchingCodes.foreachPartition{write to kafka}
}
rdd.unpersist(true)
}

Accessing Collection of DStreams

I am trying to access a collection of filtered DStreams obtained like in the solution to this question: Spark Streaming - Best way to Split Input Stream based on filter Param
I create the Collection as follows:
val statuCodes = Set("200","500", "404")
spanTagStream.cache()
val statusCodeStreams = statuCodes.map(key => key -> spanTagStream.filter(x => x._3.get("http.status_code").getOrElse("").asInstanceOf[String].equals(key)))
I try to access statusCodeStreams in the following way:
for(streamTuple <- statusCodeStreams){
streamTuple._2.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String,String](props)
partitionOfRecords.foreach
{
x=>{
/* Code Writing to Kafka using streamTuple._1 as the topic-String */
}
}
})
)
}
When executing this I receive the following error:
java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka010.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects
How do I access the Streams to write to Kafka in a serializable way?
As the exception indicates, the DStream definition is being captured by the closure.
A simple option is to declare this DStream transient:
#transient val spamTagStream = //KafkaUtils.create...
#transient flags certain objects to be removed from the Java serialization of the object graph of some object. The key of this scenario is that some val declared in the same scope as the DStream (statusCodeStreams in this case) is used within the closure. The actual reference of that val from within the closure is outer.statusCodeStreams, causing that the serialization process to "pull" all context of outer into the closure. With #transient we mark the DStream (and also the StreamingContext) declarations as non-serializable and we avoid the serialization issue. Depending on the code structure (if it's all linear in one main function (bad practice, btw) it might be necessary to mark ALL DStream declarations + the StreamingContext instance as #transient.
If the only intent of the initial filtering is to 'route' the content to separate Kafka topics, it might be worth moving the filtering within the foreachRDD. That would make for a simpler program structure.
spamTagStream.foreachRDD{ rdd =>
rdd.cache()
statuCodes.map{code =>
val matchingCodes = rdd.filter(...)
matchingCodes.foreachPartition{write to kafka}
}
rdd.unpersist(true)
}

Spark is duplicating work

I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.