How to count number of rows per window in streaming queries? - scala

Scenario: Working on Spark Streaming in Structured SQL. I have to implement a "info" dataset about how many rows I've processed in the last "window".
A little bit of code.
val invalidData: Dataset[String] =
parsedData.filter(record => !record.isValid).map(record => record.rawInput)
val validData: Dataset[FlatOutput] = parsedData
.filter(record => record.isValid)
I have two Dataset. But since I'm working on Streaming I cannot perform a .count (Error raised: Queries with streaming sources must be executed with writeStream.start())
val infoDataset = validData
.select(count("*") as "valid")
but a new error occurs: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark and I don't want to set outputMode as complete since I don't want total count from beginning, but just last "windowed" batch.
Unfortunately I don't have any column that I could register as watermark for these datasets.
Is there a way to know how many rows are processed in each iteration?

It seems like StreamingQueryStatus could be of some help.

Related

Spark - parallel computation for different dataframes

A premise: this question might sound idiotic, but I guess I fell into confusion and/ignorance.
The question is: does Spark already optmize its physical plan to execute computations on unrelated dataframes to be in parallel? If not, would it be advisable to try and parallelize such processes? Example below.
Let's assume I have the following scenario:
val df1 = read table into dataframe
val df2 = read another table into dataframe
val aTransformationOnDf1 = df1.filter(condition).doSomething
val aSubSetOfTransformationOnDf1 = aTransformationOnDf1.doSomeOperations
// Push to Kafka
aSubSetOfTransformationOnDf1.toJSON.pushToKafkaTopic
val anotherTransformationOnDf1WithDf2 = df1.filter(anotherCondition).join(df2).doSomethingElse
val yetAnotherTransformationOnDf1WithDf2 = df1.filter(aThirdCondition).join(df2).doAnotherThing
val unionAllTransformation = aTransformationOnDf1
.union(anotherTransformationOnDf1WithDf2)
.union(yetAnotherTransformationOnDf1WithDf2)
unionAllTransformation.write.mode(whatever).partitionBy(partitionColumn).save(wherever)
Basically I have two initial dataframes. One is an avent log with past events and new events to process. As an example:
a subset of these new events must be processed and pushed to Kafka.
a subset of the past events could have updates, so they must be processed alone
another subset of the past events could have another kind of updates, so they must be processed alone
In the end, all processed events are unified in one dataframe to be written back to the events' log table.
Question: does Spark process the different subsets in parallel or sequentially (and onyl computation within each individual dataframe is performed distributedly)?
If not, could we enforce parallel computation of each individual subset before the union? I know Scala has a Future propery, though I never used it.
Something like>
def unionAllDataframes(df1: DataFrame, df2: DataFrame, df3: DataFrame): Future[DafaFrame] = {
Future { df1.union(df2).union(df2) }
}
// At the end
val finalDf = unionAllDataframes(
aTransformationOnDf1,
anotherTransformationOnDf1WithDf2,
yetAnotherTransformationOnDf1WithDf2)
finalDf.onComplete({
case Success(df) => df.write(etc...)
case Failure(exception) => handleException(exception)
})
Sorry for the horrendous design and probably the wrong usage of Future. Once again, I am a bit confused on this scenario and I am trying to micro-optimize this passage (if possible).
Thanks a lot in advance!
Cheers

Consuming exactly one file per micro batch in spark stream

I am new to spark streaming technology and doing a POC to migrate one of our ETL application to spark from commercial ETL tool. We get many small files thorough out the day as INPUT and will do the ETL operation on that.
As part of POC, we are trying out spark streaming. Currently, we are struck in logic to generate Surrogate KEY values for each record and we identified ROW_NUMBER as the way to generate SK, but it doesn't work for spark stream, as it has to be windowed operation.
I am trying to use the record number column (sequence number for each record in side the files) to generate SK.
Consume exactly one file at each micro batch.
calculate Sk = Previous batch max record number + record number
In write stream phase -calculate and store the max record number value using "foreachbatch" option.
I have used "maxFilesPerTrigger" option to fetch one file, but it doesn't guarantee exactly one file per micro batch.
val smsfDF_stream = spark.readStream.option("sep", ",").schema(schema).option("maxFilesPerTrigger", 1).option("cleanSource","archive").option("sourceArchiveDir", "/edw_ab_data/spark_poc/archive").csv("/edw_ab_data/spark_poc/input")
Also, I see max value for each batchdf, I see '9999' as value
val fin_query = resinterdf2.writeStream.format("csv").foreachBatch( (batchDF: DataFrame, batchId: Long ) => {
batchDF.createOrReplaceTempView("batchdf_tab")
val file_name = "max_file_"+batchId+".dat"
val max_rec_id = batchDF.sparkSession.sql("""select max(rec_number) from batchdf_tab""")
max_rec_id.write.format("csv").mode("overwrite").save("/edw_ab_data/spark_poc/max_val.dat")}).option("checkpointLocation", "/edw_ab_data/spark_poc").outputMode("append").option("path", "/edw_ab_data/spark_poc/output").start()

How to do df.rdd or df.collect().foreach on streaming dataset?

This is the exception I am getting whenever I am trying to convert it.
val df_col = df.select("ts.user.friends_count").collect.map(_.toSeq)
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
All I am trying to do is replicate the following sql.dataframe operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
collect is a big no-no even in Spark Core's RDD world due to the size of the data you may transfer back to the driver's single JVM. It just sets the boundary of the benefits of Spark as after collect you are in a single JVM.
With that said, think about unbounded data, i.e. a data stream, that will never terminate. That's Spark Structured Streaming.
A streaming Dataset is one that is never complete and the data inside varies every time you ask for the content, i.e. the result of executing the structured query over a stream of data.
You simply cannot say "Hey, give me the data that is the content of a streaming Dataset". That does not even make sense.
That's why you cannot collect on a streaming dataset. It is not possible up to Spark 2.2.1 (the latest version at the time of this writing).
If you want to receive the data that is inside a streaming dataset for a period of time (aka batch interval in Spark Streaming or trigger in Spark Structured Streaming) you write the result to a streaming sink, e.g. console.
You can also write your custom streaming sink that does collect.map(_.toSeq) inside addBatch which is the main and only method of a streaming sink. As a matter of fact, console sink does exactly it.
All I am trying to do is replicate the following sql.dataframe
operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
The very first solution that comes to my mind is to use foreach sink:
The foreach operation allows arbitrary operations to be computed on the output data.
That, of course, does not mean that this is the best solution. Just one that comes to my mind immediately.

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

Spark streaming: Cache DStream results across batches

Using Spark streaming (1.6) I have a filestream for reading lookup data with 2s of batch size, however files are copyied to the directory only every hour.
Once there's a new file, its content is read by the stream, this is what I want to cache into memory and keep there
until new files are read.
There's another stream to which I want to join this dataset therefore I'd like to cache.
This is a follow-up question of Batch lookup data for Spark streaming.
The answer does work fine with updateStateByKey however I don't know how to deal with cases when a KV pair is
deleted from the lookup files, as the Sequence of values in updateStateByKey keeps growing.
Also any hint how to do this with mapWithState would be great.
This is what I tried so far, but the data doesn't seem to be persisted:
val dictionaryStream = ssc.textFileStream("/my/dir")
dictionaryStream.foreachRDD{x =>
if (!x.partitions.isEmpty) {
x.unpersist(true)
x.persist()
}
}
DStreams can be persisted directly using persist method which persist every RDD in the stream:
dictionaryStream.persist
According to the official documentation this applied automatically for
window-based operations like reduceByWindow and reduceByKeyAndWindow and state-based operations like updateStateByKey
so there should be no need for explicit caching in your case. Also there is no need for manual unpersisting. To quote the docs once again:
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared
and a retention period is tuned automatically based on the transformations which are used in the pipeline.
Regarding mapWithState you'll have to provide a StateSpec. A minimal example requires a functions which takes key, Option of current value and previous state. Lets say you have DStream[(String, Long)] and you want to record maximum value so far:
val state = StateSpec.function(
(key: String, current: Option[Double], state: State[Double]) => {
val max = Math.max(
current.getOrElse(Double.MinValue),
state.getOption.getOrElse(Double.MinValue)
)
state.update(max)
(key, max)
}
)
val inputStream: DStream[(String, Double)] = ???
inputStream.mapWithState(state).print()
It is also possible to provide initial state, timeout interval and capture current batch time. The last two can be used to implement removal strategy for the keys which haven't been update for some period of time.