Spark streaming: Cache DStream results across batches - scala

Using Spark streaming (1.6) I have a filestream for reading lookup data with 2s of batch size, however files are copyied to the directory only every hour.
Once there's a new file, its content is read by the stream, this is what I want to cache into memory and keep there
until new files are read.
There's another stream to which I want to join this dataset therefore I'd like to cache.
This is a follow-up question of Batch lookup data for Spark streaming.
The answer does work fine with updateStateByKey however I don't know how to deal with cases when a KV pair is
deleted from the lookup files, as the Sequence of values in updateStateByKey keeps growing.
Also any hint how to do this with mapWithState would be great.
This is what I tried so far, but the data doesn't seem to be persisted:
val dictionaryStream = ssc.textFileStream("/my/dir")
dictionaryStream.foreachRDD{x =>
if (!x.partitions.isEmpty) {
x.unpersist(true)
x.persist()
}
}

DStreams can be persisted directly using persist method which persist every RDD in the stream:
dictionaryStream.persist
According to the official documentation this applied automatically for
window-based operations like reduceByWindow and reduceByKeyAndWindow and state-based operations like updateStateByKey
so there should be no need for explicit caching in your case. Also there is no need for manual unpersisting. To quote the docs once again:
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared
and a retention period is tuned automatically based on the transformations which are used in the pipeline.
Regarding mapWithState you'll have to provide a StateSpec. A minimal example requires a functions which takes key, Option of current value and previous state. Lets say you have DStream[(String, Long)] and you want to record maximum value so far:
val state = StateSpec.function(
(key: String, current: Option[Double], state: State[Double]) => {
val max = Math.max(
current.getOrElse(Double.MinValue),
state.getOption.getOrElse(Double.MinValue)
)
state.update(max)
(key, max)
}
)
val inputStream: DStream[(String, Double)] = ???
inputStream.mapWithState(state).print()
It is also possible to provide initial state, timeout interval and capture current batch time. The last two can be used to implement removal strategy for the keys which haven't been update for some period of time.

Related

In Spark, how objects and variables are kept in memory and across different executors?

In Spark, how objects and variables are kept in memory and across different executors?
I am using:
Spark 3.0.0
Scala 2.12
I am working on writing a Spark Structured Streaming job with a custom Stream Source. Before the execution of the spark query, I create a bunch of metadata which is used by my Spark Streaming Job
I am trying to understand how this metadata is kept in memory across different executors?
Example Code:
case class JobConfig(fieldName: String, displayName: String, castTo: String)
val jobConfigs:List[JobConfig] = build(); //build the job configs
val query = spark
.readStream
.format("custom-streaming")
.load
query
.writeStream
.trigger(Trigger.ProcessingTime(2, TimeUnit.MINUTES))
.foreachBatch { (batchDF: DataFrame, batchId: Long) => {
CustomJobExecutor.start(jobConfigs) //CustomJobExecutor does data frame transformations and save the data in PostgreSQL.
}
}.outputMode(OutputMode.Append()).start().awaitTermination()
Need help in understanding following:
In the sample code, how Spark will keep "jobConfigs" in memory across different executors?
Is there any added advantage of broadcasting?
What is the efficient way of keeping the variables which can't be deserialized?
Local variables are copied for each task meanwhile broadcasted variables are copied only per executor. From docs
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
It means that if your jobConfigs is large enough and the number of tasks and stages where the variable is used significantly larger than the number of executors, or deserialization is time-consuming, in that case, broadcast variables can make a difference. In other cases, they don't.

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

Spark Streaming states to be persisted to disk in addition to in memory

I have written a program using spark streaming by using map with state function which detect repetitive records and avoid such records..the function is similar as bellow:
val trackStateFunc1 = (batchTime: Time,
key: String,
value: Option[(String,String)],
state: State[Long]) => {
if (state.isTimingOut()) {
None
}
else if (state.exists()) None
else {
state.update(1L)
Some(value.get)
}
}
val stateSpec1 = StateSpec.function(trackStateFunc1)
//.initialState(initialRDD)
.numPartitions(100)
.timeout(Minutes(30*24*60))
My numbers of records could be high and I kept the time-out for about one month. Therefore, number of records and keys could be high..I wanted to know if I can save these states on Disk in addition to the Memory..something like
"RDD.persist(StorageLevel.MEMORY_AND_DISK_SER)"
I wanted to know if I can save these states on Disk in addition to the
Memory
Stateful streaming in Spark automatically get serialized to persistent storage, this is called checkpointing. When you run your stateful DStream, you must provide a checkpoint directory otherwise the graph won't be able to execute at runtime.
You can set the checkpointing interval via DStream.checkpoint. For example, if you want to set it to every 30 seconds:
inputDStream
.mapWithState(trackStateFunc)
.checkpoint(Seconds(30))
Accourding to "MapWithState" sources you can try:
mapWithStateDS.dependencies.head.persist(StorageLevel.MEMORY_AND_DISK)
actual for spark 3.0.1

How to `reduce` only within partitions in Spark Streaming, perhaps using combineByKey?

I have data already sorted by key into my Spark Streaming partitions by virtue of Kafka, i.e. keys found on one node are not found on any other nodes.
I would like to use redis and its incrby (increment by) command as a state engine and to reduce the number of requests sent to redis, I would like to partially reduce my data by doing a word count on each worker node by itself. (The key is tag+timestamp to obtain my functionality from word count).
I would like to avoid shuffling and let redis take care of adding data across worker nodes.
Even when I have checked that data is cleanly split among worker nodes, .reduce(_ + _) (Scala syntax) takes a long time (several seconds vs. sub-second for map tasks), as the HashPartitioner seems to shuffle my data to a random node to add it there.
How can I write a simple word count reduce on each partitioner without triggering the shuffling step in Scala with Spark Streaming?
Note DStream objects lack some RDD methods, which are available only through the transform method.
It seems I might be able to use combineByKey. I would like to skip the mergeCombiners() step and instead leave accumulated tuples where they are.
The book "Learning Spark" enigmatically says:
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function (appending to a list) does not save any space. If we want to disable map-side combines, we need to specify the partitioner; for now you can just use the partitioner on the source RDD by passing rdd.partitioner.
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
The book then continues to supply no syntax for how to do this, nor have I had any luck with google so far.
What is worse, as far as I know, the partitioner is not set for DStream RDDs in Spark Streaming, so I don't know how to supply a partitioner to combineByKey that doesn't end up shuffling data.
Also, what does "map-side" actually mean and what consequences does mapSideCombine = false have, exactly?
The scala implementation for combineByKey can be found at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
Look for combineByKeyWithClassTag.
If the solution involves a custom partitioner, please include also a code sample for how to apply that partitioner to the incoming DStream.
This can be done using mapPartitions, which takes a function that maps an iterator of the input RDD on one partition to an iterator over the output RDD.
To implement a word count, I map to _._2 to remove the Kafka key and then perform a fast iterator word count using foldLeft, initializing a mutable.hashMap, which then gets converted to an Iterator to form the output RDD.
val myDstream = messages
.mapPartitions( it =>
it.map(_._2)
.foldLeft(new mutable.HashMap[String, Int])(
(count, key) => count += (key -> (count.getOrElse(key, 0) + 1))
).toIterator
)

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.
Let's say I create an RDD:
val rdd = sc.textFile(file)
Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?
Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.
Finally, if I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
val rdd = sc.textFile(file)
Does that mean that the file is now partitioned across the nodes?
The file remains wherever it was. The elements of the resulting RDD[String] are the lines of the file. The RDD is partitioned to match the natural partitioning of the underlying file system. The number of partitions does not depend on the number of nodes you have.
It is important to understand that when this line is executed it does not read the file(s). The RDD is a lazy object and will only do something when it must. This is great because it avoids unnecessary memory usage.
For example, if you write val errors = rdd.filter(line => line.startsWith("error")), still nothing happens. If you then write val errorCount = errors.count now your sequence of operations will need to be executed because the result of count is an integer. What each worker core (executor thread) will do in parallel then, is read a file (or piece of file), iterate through its lines, and count the lines starting with "error". Buffering and GC aside, only a single line per core will be in memory at a time. This makes it possible to work with very large data without using a lot of memory.
I want to count the number of objects in the RDD, however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
There is no rdd.size method. There is rdd.count, which counts the number of elements in the RDD. rdd.map(x => x / rdd.count) will not work. The code will try to send the rdd variable to all workers and will fail with a NotSerializableException. What you can do is:
val count = rdd.count
val normalized = rdd.map(x => x / count)
This works, because count is an Int and can be serialized.
If I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
map does not change the number of elements. I don't know what you mean by "size". But yes, you need to perform an action, such as count to get anything out of the RDD. You see, no work at all is performed until you perform an action. (When you perform count, only the per-partition count will be sent back to the driver, of course, not "all the information".)
Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3 on HDFS). It's not an intention to split every file between all available nodes.
However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.
UPDATE: an article describing RDD internals: https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf