In Spark structured streaming how do I output complete aggregations to an external source like a REST service - scala

The task I am trying to perform is to aggregate the count of values from a dimension (field) in a DataFrame, perform some statistics like average, max, min, etc then output the aggregates to an external system by making an API call. I am using a watermark of say 30 seconds with a window size of 10 seconds. I made these sizes small to make it easier for me to test and debug the system.
The only method I have found for making API calls is to use a ForeachWriter. My problem is that the ForeachWriter executes at the partition level and only produces an aggregate per partition. So far I haven't found a way to get the rolled up aggregates other than to coalesce to 1 which is a way to slow for my streaming application.
I have found that if I use the file based sink such as the Parquet writer to HDFS that the code produces real aggregations. It also performs very well. What I really need is to achieve this same result but calling an API rather than writing to a file system.
Does anyone know how to do this?
I have tried this with Spark 2.2.2 and Spark 2.3 and get the same behavior.
Here is a simplified code fragment to illustrate what I am trying to do:
val valStream = streamingDF
.select(
$"event.name".alias("eventName"),
expr("event.clientTimestamp / 1000").cast("timestamp").as("eventTime"),
$"asset.assetClass").alias("assetClass")
.where($"eventName" === 'MyEvent')
.withWatermark("eventTime", "30 seconds")
.groupBy(window($"eventTime", "10 seconds", $"assetClass", $"eventName")
.agg(count($"eventName").as("eventCount"))
.select($"window.start".as("windowStart"), $"window.end".as("windowEnd"), $"assetClass".as("metric"), $"eventCount").as[DimAggregateRecord]
.writeStream
.option("checkpointLocation", config.checkpointPath)
.outputMode(config.outputMode)
val session = (if(config.writeStreamType == AbacusStreamWriterFactory.S3) {
valStream.format(config.outputFormat)
.option("path", config.outputPath)
}
else {
valStream.foreach(--- this is my DimAggregateRecord ForEachWriter ---)
}).start()

I answered my own question. I found that repartitioning by the window start time did the trick. It shuffles the data so that all rows with the same group and windowStart time are on the same executor. The code below produces a file for each group window interval. It also performs quite well. I don't have exact numbers but it produces aggregates in less time than the window interval of 10 seconds.
val valStream = streamingDF
.select(
$"event.name".alias("eventName"),
expr("event.clientTimestamp / 1000").cast("timestamp").as("eventTime"),
$"asset.assetClass").alias("assetClass")
.where($"eventName" === 'MyEvent')
.withWatermark("eventTime", "30 seconds")
.groupBy(window($"eventTime", "10 seconds", $"assetClass", $"eventName")
.agg(count($"eventName").as("eventCount"))
.select($"window.start".as("windowStart"), $"window.end".as("windowEnd"), $"assetClass".as("metric"), $"eventCount").as[DimAggregateRecord]
.repartition($"windowStart") // <-------- this line produces the desired result
.writeStream
.option("checkpointLocation", config.checkpointPath)
.outputMode(config.outputMode)
val session = (if(config.writeStreamType == AbacusStreamWriterFactory.S3) {
valStream.format(config.outputFormat)
.option("path", config.outputPath)
}
else {
valStream.foreach(--- this is my DimAggregateRecord ForEachWriter ---)
}).start()

Related

Spark sliding window is not triggering when there is no data from the data stream

I have a simple spark sql with time window of 5 minutes, trigger policy of every minute:
val withTime = eventStreams(0).selectExpr("*", "cast(cast(parsed.time as long)/1000 as timestamp) as event_time")
val momentumDataAggQuery = withTime
.selectExpr("parsed.symbol", "parsed.bid", "parsed.ask", "event_time")
.withWatermark("event_time", "1 minutes")
.groupBy(col("symbol"), window(col("event_time"), "5 minutes", "60 seconds")) // change to 60 minutes
.agg(first("bid", true).as("first_bid"), first("ask").as("first_ask"), last("bid").as("last_bid"), last("ask").as("last_ask"))
val momentumDataQuery = momentumDataAggQuery
.selectExpr("window.start", "window.end", "ln(((last_bid + last_ask)/2)/((first_bid + first_ask)/2)) as momentum", "symbol")
When there is data from the stream, it gets triggered every minute to calculate 'momentum' but stop when there is not data point. I expect it will continue using old data to update every minute even there is not enough data point.
Consider the example in the following table
In the 1st window, there is only one data point so log return is zero.
In the 2nd window, there is only two data points so it takes log(97.5625/97.4625), where 97.5625 was received at 11:53 and 97.4625 was received at 11:52:10, within the time window 12:19 <> 12:54...it went on to calculate the log return when there was sufficient data point.
However, when there was NO more data point after 15:56:12, say, for the window 12:54 <> 12:59, i expect it would take ln(97.8625/97.6625) where the input were generated at 11:56:12 and 11:54:11 respectively. However it's not the case, the red box were never generated.
Is there something wrong with my spark sql?

Spark flushing Dataframe on show / count

I am trying to print the count of a dataframe, and then first few rows of it, before finally sending it out for further processing.
Strangely, after a call to count() the dataframe becomes empty.
val modifiedDF = funcA(sparkDF)
val deltaDF = modifiedDF.except(sparkDF)
println(deltaDF.count()) // prints 10
println(deltaDF.count()) //prints 0, similar behavior with show
funcB(deltaDF) //gets null dataframe
I was able to verify the same using deltaDF.collect.foreach(println) and subsequent calls to count.
However, if I do not call count or show, and just send it as is, funcB gets the whole DF with 10 rows.
Is it expected?
Definition of funcA() and its dependencies:
def funcA(inputDataframe: DataFrame): DataFrame = {
val col_name = "colA"
val modified_df = inputDataframe.withColumn(col_name, customUDF(col(col_name)))
val modifiedDFRaw = modified_df.limit(10)
modifiedDFRaw.withColumn("colA", modifiedDFRaw.col("colA").cast("decimal(38,10)"))
}
val customUDF = udf[Option[java.math.BigDecimal], java.math.BigDecimal](myUDF)
def myUDF(sval: java.math.BigDecimal): Option[java.math.BigDecimal] = {
val strg_name = Option(sval).getOrElse(return None)
if (change_cnt < 20) {
change_cnt = change_cnt + 1
Some(strg_name.multiply(new java.math.BigDecimal("1000")))
} else {
Some(strg_name)
}
}
First of all function used as UserDefinedFunction has to be at least idempotent, but optimally pure. Otherwise the results are simply non-deterministic. While some escape hatch is provided in the latest versions (it is possible to hint Spark that function shouldn't be re-executed) these won't help you here.
Moreover having mutable stable (it is not exactly clear what is the source of change_cnt, but it is both written and read in the udf) as simply no go - Spark doesn't provide global mutable state.
Overall your code:
Modifies some local copy of some object.
Makes decision based on such object.
Unfortunately both components are simply not salvageable. You'll have to go back to planning phase and rethink your design.
Your Dataframe is a distributed dataset and trying to do a count() returns unpredictable results since the count() can be different in each node. Read the documentation about RDDs below. It is applicable to DataFrames as well.
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#understanding-closures-
https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#printing-elements-of-an-rdd

How the number of Tasks and Partitions is set when using MemoryStream?

I'm trying to understand a strange behavior that I observed in my Spark structure streaming application that is running in local[*] mode.
I have 8 core on my machines. While the majority of my Batches have 8 partitions, every once in a while I get 16 or 32 or 56 and so on partitions/Tasks. I notice that it is always a multiple of 8. I have notice in opening the stage tab, that when it happens, it is because there is multiple LocalTableScan.
That is if I have 2 LocalTableScan then the mini-batch job, will have 16 task/partition and so on.
I mean it could well do two scans, combine the two batches and feed it to the mini-batch job. However no it results in a mini-batch job that the number of tasks = number of core * number of scan.
Here is how I set my MemoryStream:
val rows = MemoryStream[Map[String,String]]
val df = rows.toDF()
val rdf = df.mapPartitions{ it => {.....}}(RowEncoder.apply(StructType(List(StructField("blob", StringType, false)))))
I have a future that feeds my memory stream as such, right after:
Future {
blocking {
for (i <- 1 to 100000) {
rows.addData(maps)
Thread.sleep(3000)
}
}
}
and then my query:
rdf.writeStream.
trigger(Trigger.ProcessingTime("1 seconds"))
.format("console").outputMode("append")
.queryName("SourceConvertor1").start().awaitTermination()
I wonder why the numbers of Tasks varies ? How is it supposed to be determined by Spark ?

spark streaming - use previous calculated dataframe in next iteration

I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation
Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html
If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Reading Mongo data from Spark

I am reading data from mongodb on a spark job, using the com.mongodb.spark.sql connector (v 2.0.0).
It works fine for most db's, but for a specific db, the stage takes a long time and the number of partitions is very high.
My program is set on 128 partitions (x2 number of vCPUs) which works fine after some testing the we did. On this load the number jumps to 2061 partitions and the stage takes several minutes to process, even though I am using a filter and the document clearly states that filters are done on the underlining data source (https://docs.mongodb.com/spark-connector/v2.0/scala/datasets-and-sql/)
This is how I read data:
val readConfig: ReadConfig = ReadConfig(
Map(
"spark.mongodb.input.uri" -> s"${mongodb.uri}/?${mongodb.uriParams}",
"spark.mongodb.input.database" -> s"${mongodb.dbNamesConfig.siteInstances}",
"collection" -> params.collectionName
), None)
val df: DataFrame = sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions)
.schema(impressionSchema)
.load()
println("df: " + df.rdd.getNumPartitions) // this is 2061 partitions
val filteredDF = df.coalesce(128).filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"component_type" === SITE_INSTANCE_CHART_COMPONENT)
)
println("filteredDF: " + filteredDF.rdd.getNumPartitions) // 128 after using coalesce
filteredDF.select(
$"iid",
$"instance_id".as("instanceId"),
$"_global_visitor_key".as("globalVisitorKey"),
$"_timestamp".as("timestamp"),
$"_timestamp".cast(DataTypes.DateType).as("date")
)
Data is not very big (Shuffle Write is 20MB for this stage) and even if I filter only 1 document, the run time is the same (only the Shuffle Write is much smaller).
How can solve this?
Thanks
Nir