Why Spark Structured Streaming window aggregation evaluates after each trigger - spark-structured-streaming

With Spark 2.2.0, I am reading data from Kafka having 2 columns "textcol" and "time". The "time" column has the latest processing time. I want to get the count of my unique values of "textcol" in fixed window duration of 20 seconds. My trigger duration is 10 seconds.
For example if in a 20 sec window duration, trigger1 has textcol=a and trigger2 has textcol=b, then I am expecting to have output as below after 20 sec
textcol cnt
a 1
b 1
I used below code for dataset ds
ds.groupBy(functions.col("textcol"),
functions.window(functions.col("time"), "20 seconds"))
.agg(functions.count("textcol").as("cnt"))
.writeStream().trigger(Trigger.ProcessingTime("10 seconds"))
.outputMode("update")
.format("console").start();
But I am getting output twice due to 2 triggers after 20 sec
Trigger1:
textcol cnt
a 1
Trigger2:
textcol cnt
b 1
So why window does not aggregate the results and outputs after 20 sec, instead of triggering each time 10-10 sec?
Is there any other way to achieve it in spark structured streaming?

change your .outputMode("update") to .outputMode("complete")

Related

Pyspark write stream one column at a time

The source .csv has 414 columns each with a new date:
The count increases by the total number of COVID deaths up to that date.
I want to display in a Databricks dashboard a stream which will increment up as the total deaths to date increases. Iterating through the date columns from left to right for 412 days. I will insert a pause on the stream after each day, then ingest the next day's results. Displaying the total by state as it increments up with each day.
So far:
df = spark.read.option("header", "true").csv("/databricks-datasets/COVID/USAFacts/covid_deaths_usafacts.csv")
This initial df has 418 columns and I have changed all of the day columns to IntegerType; keeping only the State and County columns as string.
from pyspark.sql import functions as F
for col in temp_df.columns:
temp_df = temp_df.withColumn(
col,
F.col(col).cast("integer")
)
and then
from pyspark.sql.functions import col
temp_df.withColumn("County Name",col("County Name").cast('integer')).withColumn("State",col("State").cast('integer'))
Then I use df.schema to get the schema and do a second ingest of the .csv, this time with the schema defined. But my next challenge is the most difficult, to stream in the results one column at a time.
Or can I simply PIVOT ? If yes, then like this?
pivotDF = df.groupBy("State").pivot("County", countyFIPS)

How to ensure outer NULL join results output in spark streaming if the future events are delayed

In a scenario of Spark stream-stream outer join:
val left = spark.readStream.format("delta").load("...")
.withWatermark("enqueuedTime", "1 hour")
val right = spark.readStream.format("delta").load("...")
.withWatermark("enqueuedTime", "1 hour")
val res = left.as("left").join(right.as("right"),
expr("left.key = right.key AND (left.enqueuedTime BETWEEN right.enqueuedTime - INTERVAL 1 hour AND right.enqueuedTime + INTERVAL 1 hour)"),
"left_outer")
res.writeStream(....)
And a data in left and right streams:
How to ensure a record:
2, left_value1, 2022-04-18T12:39:49.370+0000, NULL, NULL, NULL
is outputted after a given period of time even if new events aren't flowing thought the stream?
I'm only able to get it if new events arrive to both tables, like:
INSERT into left_df VALUES ("004", "left_df_value", current_timestamp() + INTERVAL 5 hours);
INSERT into right_df VALUES ("004", "right_df_value", current_timestamp() + INTERVAL 5 hours);
Using which, Spark updates the watermarks and understands that now it's safe to output a nullable record. But how to still output it after some kind of timeout, without the new records arriving to both streams?

Spark optimization - joins - very low number of task - OOM

My spark application fail with this error : Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
This is what i get when I inspect the containger log : java.lang.OutOfMemoryError: Java heap space
My application is mainly get a table then join differents tables that i read from aws S3:
var result = readParquet(table1)
val table2 = readParquet(table2)
result = result.join(table2 , result(primaryKey) === table2(foreignKey))
val table3 = readParquet(table3)
result = result.join(table3 , result(primaryKey) === table3(foreignKey))
val table4 = readParquet(table4)
result = result.join(table4 , result(primaryKey) === table4(foreignKey))
and so on
My application fail when i try to save my result dataframe to postgresql using :
result.toDF(df.columns.map(x => x.toLowerCase()): _*).write
.mode("overwrite")
.format("jdbc")
.option(JDBCOptions.JDBC_TABLE_NAME, table)
.save()
On my failed join Stage i have a very low number of task : 6 tasks for 4 executors
Why my Stage stage generate 2 jobs ?
The first one is completed with 426 task :
and the second one is failing :
My spark-submit conf :
dynamicAllocation = true
num core = 2
driver memory = 6g
executor memory = 6g
max num executor = 10
min num executor = 1
spark.default.parallelism = 400
spark.sql.shuffle.partitions = 400
I tried with more resources but same problem :
num core = 5
driver memory = 16g
executor memory = 16g
num executor = 20
I think that all the data go to same partition/executor even with a default number of 400 partition and this cause a OOM error
I tried (without success) :
persit data
broadcastJoin, but my table is not small enough to broadcast it at the end.
repartition to higher number (4000) an do a count between each join to perform a action :
my main table seam to growth very fast :
(number of rows ) 40 -> 68 -> 7304 -> 946 832 -> 123 032 864 -> 246 064 864 -> (too much time after )
However the data size seam very low
If i look at task metrics a interesting thing is that my data seam skewed ( i am realy not sure )
In the last count action, i can see that ~120 task perform action , with ~10MB of input data for 100 Records and 12 seconds and the other 3880 tasks do absolutly nothings ( 3ms , 0 records 16B ( metadata ? ) ):
driver memory = 16g is too high memory and not needed. use only when you have a huge collection of data to master by actions like (collect() ) make sure to increase spark.maxResult.size if that is the case
you can do the following things
-- Do repartition while reading files readParquet(table1).repartition(x).if one of the tables is small then you can broadcast that and remove join instead use mapPartition and use a broadcast variable as lookup cache.
(OR)
-- Select a column that is uniformly distributed and repartition your table accordingly using that particular column.
Two points I need to press by looking in the above stats. your job has high scheduling delay which is caused by too many tasks and your task stats few stats are launched with input data as 10 bytes and few launched with 9MB.... obviously, there is data skewness here ... as you said The first one is completed with 426 tasks but with 4000 as repartition count it should launch more tasks
please look at https://towardsdatascience.com/the-art-of-joining-in-spark-dcbd33d693c ... for more insights.

Structured spark streaming leftOuter joins behaves like inner join

I'm trying structured spark streaming stream-stream join, and my left outer joins behaves exactly same as inner join.
Using spark version 2.4.2 and Scala version 2.12.8, Eclipse OpenJ9 VM, 1.8.0_252
Here is what I'm trying to do,
Create rate stream which generates 1 row per second.
Create Employee and Dept stream out of it.
Employee stream deptId field multiplies rate value by 2 and Dept stream id field by 3
Purpose of doing this is to have two stream which have few common and not common id field.
Do leftOuter stream-stream join with time constraint of 30 sec and dept stream on left side of join.
Expectation:
After 30 seconds of time constraint, for unmatched rows, I should be see null on right side of join.
Whats happening
I only see rows where there was match between ids and not unmatched rows.
Code - trying on spark-shell
import java.sql.Timestamp
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
case class RateData(timestamp: Timestamp, value: Long)
// create rate source with 1 row per second.
val rateSource = spark.readStream.format("rate").option("rowsPerSecond", 1).option("numPartitions", 1).option("rampUpTime", 1).load()
import spark.implicits._
val rateSourceData = rateSource.as[RateData]
// employee stream departid increments by 2
val employeeStreamDS = rateSourceData.withColumn("firstName", concat(lit("firstName"),rateSourceData.col("value")*2)).withColumn("departmentId", lit(floor(rateSourceData.col("value")*2))).withColumnRenamed("timestamp", "empTimestamp").withWatermark("empTimestamp", "10 seconds")
// dept stream id increments by 3
val departmentStreamDS = rateSourceData.withColumn("name", concat(lit("name"),floor(rateSourceData.col("value")*3))).withColumn("Id", lit(floor(rateSourceData.col("value")*3))).drop("value").withColumnRenamed("timestamp", "depTimestamp")
// watermark - 10s and time constraint is 30 secs on employee stream.
val joinedDS = departmentStreamDS.join(employeeStreamDS, expr(""" id = departmentId AND empTimestamp >= depTimestamp AND empTimestamp <= depTimestamp + interval 30 seconds """), "leftOuter")
val q = joinedDS.writeStream.format("parquet").trigger(Trigger.ProcessingTime("60 seconds")).option("checkpointLocation", "checkpoint").option("path", "rate-output").start
I queried the output of the table after 10 mins and I only found 31 matching rows. which is same as inner join output.
val df = spark.read.parquet("rate-output")
df.count
res0: Long = 31
df.agg(min("departmentId"), max("departmentId")).show
+-----------------+-----------------+
|min(departmentId)|max(departmentId)|
+-----------------+-----------------+
| 0| 180|
+-----------------+-----------------+
Explanation of output.
employeeStreamDS stream, departmentId field value is 2 times rate value so it is multiples of two.
departmentStreamDS stream, Id field is 3 times rate stream value so multiple of 3.
so there would match of departmentId = Id for every 6 because, LCM(2,3) = 6.
that would happen until there is difference of 30 sec between those streams(join time constraint).
I would expect after 30 seconds, I would null values for dept stream values(3,9,15 .. ) and so on.
I hope I'm explaining it well enough.
So the result question about left-outer join behavior for spark streaming.
From my understanding and indeed according to https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins, you need to apply watermarks on event-time columns of both streams, like e.g.:
val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")
...
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
"""),
joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter"
)
You have only one watermark defined.

Feature engineering of rolling windows with apache beam

I have been able to read in the following data representing customer transactions as csv with Beam (Python SDK).
timestamp,customer_id,amount
2018-02-08 12:04:36.899422,1,45.92615814813004
2019-04-05 07:40:17.873746,1,47.360044568200514
2019-07-27 04:37:48.060949,1,23.325754816230106
2017-05-18 15:46:41.654809,2,25.47369262400646
2018-08-08 03:59:05.791552,2,34.859367944028875
2019-01-02 02:44:35.208450,2,5.2753275435507705
2020-03-06 09:45:29.866731,2,35.656304542140404
2020-05-28 20:19:08.593375,2,23.23715711587539
The csv is being read in as follows:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.textio import ReadFromText
import datetime
class Split(beam.DoFn):
def process(self, element):
timestamp, customer_id, amount = element.split(",")
return [{
'timestamp': timestamp,
'customer': int(customer_id),
'amount': float(amount)
}]
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
rows = (
p |
ReadFromText('../data/sample_trxns.csv', skip_header_lines=1) |
beam.ParDo(Split())
)
class UnixTime(beam.DoFn):
def process(self, element):
"""
Returns a list of tuples containing customer and amount
"""
unix_time = datetime.datetime.strptime(
element['timestamp'],
"%Y-%m-%d %H:%M:%S.%f"
).timestamp()
return [{
'timestamp': unix_time,
'customer': element['customer'],
'amount': element['amount']
}]
class AddTimestampDoFn(beam.DoFn):
def process(self, element):
unix_timestamp = element['timestamp']
# Wrap and emit the current entry and new timestamp in a
# TimestampedValue.
yield beam.window.TimestampedValue(element, unix_timestamp)
timed_rows = (
rows |
beam.ParDo(UnixTime()) |
beam.ParDo(AddTimestampDoFn())
)
However with Beam I have been unable to derive rolling window features such as for 'customer mean transaction value over last 1000 days', and equivalent rolling window features for min, max and sum (excluding the current row in each calculation). This demonstrates the desired values of the feature calculating with the pandas.Series.rolling function and printing the resulting pandas dataframe:
customer_id amount mean_trxn_amount_last_1000_days
timestamp
2018-02-08 12:04:36.899422 1 45.926158 NaN
2019-04-05 07:40:17.873746 1 47.360045 45.926158
2019-07-27 04:37:48.060949 1 23.325755 46.643101
2017-05-18 15:46:41.654809 2 25.473693 NaN
2018-08-08 03:59:05.791552 2 34.859368 25.473693
2019-01-02 02:44:35.208450 2 5.275328 30.166530
2020-03-06 09:45:29.866731 2 35.656305 20.067348
2020-05-28 20:19:08.593375 2 23.237157 25.263667
I have not found any documentation for similar functionality in Beam - is such functionality available? If not, am I misunderstanding the intended scope of what Beam is meant to provide, or is this sort of functionality likely to be available in the future? Thanks.
You can make use of windowing as you have extracted the timestamps in your sample code.
Fixed windows:
"The simplest form of windowing is using fixed time windows: given a timestamped PCollection which might be continuously updating, each window might capture (for example) all elements with timestamps that fall into a 30 second interval."
Sliding windows:
"A sliding time window also represents time intervals in the data stream; however, sliding time windows can overlap. For example, each window might capture 60 seconds worth of data, but a new window starts every 30 seconds. The frequency with which sliding windows begin is called the period. Therefore, our example would have a window duration of 60 seconds and a period of 30 seconds."
Apply the window then make use of either the inbuilt functions for Min/Max/Sum etc... or create your own combiner.