How to ensure outer NULL join results output in spark streaming if the future events are delayed - scala

In a scenario of Spark stream-stream outer join:
val left = spark.readStream.format("delta").load("...")
.withWatermark("enqueuedTime", "1 hour")
val right = spark.readStream.format("delta").load("...")
.withWatermark("enqueuedTime", "1 hour")
val res = left.as("left").join(right.as("right"),
expr("left.key = right.key AND (left.enqueuedTime BETWEEN right.enqueuedTime - INTERVAL 1 hour AND right.enqueuedTime + INTERVAL 1 hour)"),
"left_outer")
res.writeStream(....)
And a data in left and right streams:
How to ensure a record:
2, left_value1, 2022-04-18T12:39:49.370+0000, NULL, NULL, NULL
is outputted after a given period of time even if new events aren't flowing thought the stream?
I'm only able to get it if new events arrive to both tables, like:
INSERT into left_df VALUES ("004", "left_df_value", current_timestamp() + INTERVAL 5 hours);
INSERT into right_df VALUES ("004", "right_df_value", current_timestamp() + INTERVAL 5 hours);
Using which, Spark updates the watermarks and understands that now it's safe to output a nullable record. But how to still output it after some kind of timeout, without the new records arriving to both streams?

Related

Spark Structure Streaming Differentiate over time

I have a table with a timestamp column (t) and a list of columns for which I would like to compute the difference over time (v), grouped by some key(k): v_diff(t) = v(t)-v(t-1) for each k independently.
Normally I would write:
lag_window = Window.partitionBy(COLS_TO_DIFF).orderBy('timestamp')
for col in COLS_TO_DIFF:
df = df.withColumn(
col + "_diff",
df[col] - F.lag(df[col]).over(lag_window))
Yet for streaming I get:
AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;
How do I get around that?
Note: my data is streaming slowly in batches

Pyspark write stream one column at a time

The source .csv has 414 columns each with a new date:
The count increases by the total number of COVID deaths up to that date.
I want to display in a Databricks dashboard a stream which will increment up as the total deaths to date increases. Iterating through the date columns from left to right for 412 days. I will insert a pause on the stream after each day, then ingest the next day's results. Displaying the total by state as it increments up with each day.
So far:
df = spark.read.option("header", "true").csv("/databricks-datasets/COVID/USAFacts/covid_deaths_usafacts.csv")
This initial df has 418 columns and I have changed all of the day columns to IntegerType; keeping only the State and County columns as string.
from pyspark.sql import functions as F
for col in temp_df.columns:
temp_df = temp_df.withColumn(
col,
F.col(col).cast("integer")
)
and then
from pyspark.sql.functions import col
temp_df.withColumn("County Name",col("County Name").cast('integer')).withColumn("State",col("State").cast('integer'))
Then I use df.schema to get the schema and do a second ingest of the .csv, this time with the schema defined. But my next challenge is the most difficult, to stream in the results one column at a time.
Or can I simply PIVOT ? If yes, then like this?
pivotDF = df.groupBy("State").pivot("County", countyFIPS)

Structured spark streaming leftOuter joins behaves like inner join

I'm trying structured spark streaming stream-stream join, and my left outer joins behaves exactly same as inner join.
Using spark version 2.4.2 and Scala version 2.12.8, Eclipse OpenJ9 VM, 1.8.0_252
Here is what I'm trying to do,
Create rate stream which generates 1 row per second.
Create Employee and Dept stream out of it.
Employee stream deptId field multiplies rate value by 2 and Dept stream id field by 3
Purpose of doing this is to have two stream which have few common and not common id field.
Do leftOuter stream-stream join with time constraint of 30 sec and dept stream on left side of join.
Expectation:
After 30 seconds of time constraint, for unmatched rows, I should be see null on right side of join.
Whats happening
I only see rows where there was match between ids and not unmatched rows.
Code - trying on spark-shell
import java.sql.Timestamp
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
case class RateData(timestamp: Timestamp, value: Long)
// create rate source with 1 row per second.
val rateSource = spark.readStream.format("rate").option("rowsPerSecond", 1).option("numPartitions", 1).option("rampUpTime", 1).load()
import spark.implicits._
val rateSourceData = rateSource.as[RateData]
// employee stream departid increments by 2
val employeeStreamDS = rateSourceData.withColumn("firstName", concat(lit("firstName"),rateSourceData.col("value")*2)).withColumn("departmentId", lit(floor(rateSourceData.col("value")*2))).withColumnRenamed("timestamp", "empTimestamp").withWatermark("empTimestamp", "10 seconds")
// dept stream id increments by 3
val departmentStreamDS = rateSourceData.withColumn("name", concat(lit("name"),floor(rateSourceData.col("value")*3))).withColumn("Id", lit(floor(rateSourceData.col("value")*3))).drop("value").withColumnRenamed("timestamp", "depTimestamp")
// watermark - 10s and time constraint is 30 secs on employee stream.
val joinedDS = departmentStreamDS.join(employeeStreamDS, expr(""" id = departmentId AND empTimestamp >= depTimestamp AND empTimestamp <= depTimestamp + interval 30 seconds """), "leftOuter")
val q = joinedDS.writeStream.format("parquet").trigger(Trigger.ProcessingTime("60 seconds")).option("checkpointLocation", "checkpoint").option("path", "rate-output").start
I queried the output of the table after 10 mins and I only found 31 matching rows. which is same as inner join output.
val df = spark.read.parquet("rate-output")
df.count
res0: Long = 31
df.agg(min("departmentId"), max("departmentId")).show
+-----------------+-----------------+
|min(departmentId)|max(departmentId)|
+-----------------+-----------------+
| 0| 180|
+-----------------+-----------------+
Explanation of output.
employeeStreamDS stream, departmentId field value is 2 times rate value so it is multiples of two.
departmentStreamDS stream, Id field is 3 times rate stream value so multiple of 3.
so there would match of departmentId = Id for every 6 because, LCM(2,3) = 6.
that would happen until there is difference of 30 sec between those streams(join time constraint).
I would expect after 30 seconds, I would null values for dept stream values(3,9,15 .. ) and so on.
I hope I'm explaining it well enough.
So the result question about left-outer join behavior for spark streaming.
From my understanding and indeed according to https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins, you need to apply watermarks on event-time columns of both streams, like e.g.:
val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")
...
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
"""),
joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter"
)
You have only one watermark defined.

How to tune mapping/filtering on big datasets (cross joined from two datasets)?

Spark 2.2.0
I have the following code converted from SQL script. It has been running for two hours and it's still running. Even slower than SQL Server. Is anything not done correctly?
The following is the plan,
Push table2 to all executors
Partition table1 and distribute the partitions to executors.
And each row in table2/t2 joins (cross join) each partition of table1.
So the calculation on the result of the cross-join can be run distributed/parallelly. (I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
case class Cols (Id: Int, F2: String, F3: BigDecimal, F4: Date, F5: String,
F6: String, F7: BigDecimal, F8: String, F9: String, F10: String )
case class Result (Id1: Int, ID2: Int, Point: Int)
def getDataFromDB(source: String) = {
import sqlContext.sparkSession.implicits._
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$source"
)).load()
.select("Id", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10")
.as[Cols]
}
val sc = new SparkContext(conf)
val table1:DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
println(table1.count()) // about 300K rows
val table2:DataSet[Cols] = getDataFromDB("table2") // ~20K rows
table2.take(1)
println(table2.count())
val t2 = sc.broadcast(table2)
import org.apache.spark.sql.{functions => func}
val j = table1.joinWith(t2.value, func.lit(true))
j.map(x => {
val (l, r) = x
Result(l.Id, r.Id,
(if (l.F1!= null && r.F1!= null && l.F1== r.F1) 3 else 0)
+(if (l.F2!= null && r.F2!= null && l.F2== r.F2) 2 else 0)
+ ..... // All kind of the similiar expression
+(if (l.F8!= null && r.F8!= null && l.F8== r.F8) 1 else 0)
)
}).filter(x => x.Value >= 10)
println("Total count %d", j.count()) // This takes forever, the count will be about 100
How to rewrite it with Spark idiomatic way?
Ref: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
(Somehow I feel as if I have seen the code already)
The code is slow because you use just a single task to load the entire dataset from the database using JDBC and despite cache it does not benefit from it.
Start by checking out the physical plan and Executors tab in web UI to find out about the single executor and the single task to do the work.
You should use one of the following to fine-tune the number of tasks for loading:
Use partitionColumn, lowerBound, upperBound options for the JDBC data source
Use predicates option
See JDBC To Other Databases in Spark's official documentation.
After you're fine with the loading, you should work on improving the last count action and add...another count action right after the following line:
val table1: DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
// trigger caching as it's lazy in Dataset API
table1.count
The reason why the entire query is slow is that you only mark table1 to be cached when an action gets executed which is exactly at the end (!) In other words, cache does nothing useful and more importantly makes the query performance even worse.
Performance will increase after you table2.cache.count too.
If you want to do cross join, use crossJoin operator.
crossJoin(right: Dataset[_]): DataFrame Explicit cartesian join with another DataFrame.
Please note the note from the scaladoc of crossJoin (no pun intended).
Cartesian joins are very expensive without an extra filter that can be pushed down.
The following requirement is already handled by Spark given all the optimizations available.
So the calculation on the result of the cross-join can be run distributed/parallelly.
That's Spark's job (again, no pun intended).
The following requirement begs for broadcast.
I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
Use broadcast function to hint Spark SQL's engine to use table2 in broadcast mode.
broadcast[T](df: Dataset[T]): Dataset[T] Marks a DataFrame as small enough for use in broadcast joins.

Flink: Table API copy operators in execution plan

I use Flink 1.2.0 Table API for processing some streaming data. Following is my code:
val dataTable = myDataStream
// table A
val tableA = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group1, group2")
.select("w.start as time, group1, group2, data1.sum as data1, data2.sum as data2")
tableEnv.registerTable("tableA", tableA)
// table A sink
tableA.writeToSink(sinkTableA)
//...
// I shoul get some other outputs from TableA output
//...
val dataTable = tableEnv.ingest("tableA")
// table result1
val result1 = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group1")
.select("w.start as time, group1, data1.sum as data1")
// result1 sink
result2.writeToSink(sinkResult1)
// table result2
val result2 = dataTable
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy("w, group2")
.select("w.start as time, group2, data2.sum as data1")
// result2 sink
result2.writeToSink(sinkResult2)
I wait to get this tree in the flink execution plan.
Same as I have for Flink Streaming in my other Flink jobs.
DataStream_Operators -> TableA_Operators -> TableA_Sink
|-> Result1_Operators -> Result1_Sink
|-> Result2_Operators -> Result2_Sink
But, I get this with 3 copies of same opertoprs for TableA !
DataStream_Operators -> TableA_Operators -> TableA_Sink
|-> Copy_of_TableA_Operators -> Result1_Operators -> Result1_Sink
|-> Copy_of_TableA_Operators -> Result2_Operators -> Result2_Sink
I have bad performance with big input data for this job in result.
How I can fix this and get optimal execution plan ?
I undestand, what the Flink Table API and SQL are experimental features and
maybe it's will fixed in next versions.
At the current state, the Table API translates the whole query whenever you convert a Table into a DataSet or DataStream or write it to a TableSink. In your program, you call three times writeToSink which means that the each time the complete query is translated.
But what is the complete query? There are all Table API operators that have been applied on a Table. When you register a Table in the TableEnvironment it is basically registered as a view, i.e., only its definition (all the operators that define the Table) are registered. Therefore, these operators are translated again when you call writeToSink the second and third time.
You can solve this issue if you translate tableA into a DataStream and register the DataStream in the TableEnvironment instead for registering it as a Table. This would look as follows:
val tableA = ...
val streamA = tableA.toDataStream[X] // X should be a case class for rows of tableA
val tableEnv.registerDataStream("tableA", streamA)
tableEnv.ingest("tableA").writeToSink(sinkTableA) // emit tableA by ingesting the registered DataStream
I know, this is not very convenient but at the moment the only way to avoid repeated translation of a Table.