Structured Streaming metrics understanding - scala

I'm quite new to Structured Streaming and would like to understand a bit more in detail the main metrics of Spark.
I have a Structured Streaming process in Databricks that reads events from one Eventhub, read values from those events, creates a new df and writes this new df into a second Eventhub.
The event that comes from the first Eventhub, is an eventgrid event from which I read a url (when a blob is added to a storage account) and inside a foreachBatch, I create a new DF and write it to the second Eventhub.
The code has the following structure:
val streamingInputDF =
spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
.select(($"body").cast("string"))
def get_func( batchDF:DataFrame, batchID:Long ) : Unit = {
batchDF.persist()
for (row <- batchDF.rdd.collect) { //necessary to read the file with spark.read....
val file_url = "/mnt/" + path
// create df from readed url
val df = spark
.read
.option("rowTag", "Transaction")
.xml(file_url)
if (!(df.rdd.isEmpty)){
// some filtering
val eh_df = df.select(col(...).as(...),
val eh_jsoned = eh_df.toJSON.withColumnRenamed("value", "body")
// write to Eventhub
eh_jsoned.select("body")
.write
.format("eventhubs")
.options(eventHubsConfWrite.toMap)
.save()
}
}
batchDF.unpersist()
}
val query_test= streamingSelectDF
.writeStream
.queryName("query_test")
.foreachBatch(get_func _)
.start()
I have tried adding the maxEventsPerTrigger(100) parameter but this increases a lot the time from when the data arrives to the Storage Account until it is consumed in Databricks.
The value for maxEventsPerTrigger is set randomly in order to test behaviour.
Having seen the metrics, what sense does it make that the batch time is increasing so much and the processing rate and input rate are similar?
What approach should I consider to improve the process?
I'm running it from a Databricks 7.5 Notebook, Spark 3.0.1 and Scala 2.12.
Thank you all very much in advance.
NOTE:
XML files have the same size
First Eventhub has 20 partitions
Rate data input to first Eventhub is 2 events/sec

Related

Spark Structured Streaming dynamic lookup with Redis

i am new to spark.
We are currently building a pipeline :
Read the events from Kafka topic
Enrich this data with the help of Redis-Lookup
Write events to the new Kafka topic
So, my problem is when i want to use spark-redis library it performs very well, but data stays static in my streaming job.
Although data is refreshed at Redis, it does not reflect to my dataframe.
Spark reads data at first then never updates it.
Also i am reading from REDIS data at first,total data about 1mio key-val string.
What kind of approaches/methods i can do, i want to use Redis as in-memory dynamic lookup.
And lookup table is changing almost 1 hour.
Thanks.
used libraries:
spark-redis-2.4.1.jar
commons-pool2-2.0.jar
jedis-3.2.0.jar
Here is the code part:
import com.intertech.hortonworks.spark.registry.functions._
val config = Map[String, Object]("schema.registry.url" -> "http://aa.bbb.ccc.yyy:xxxx/api/v1")
implicit val srConfig:SchemaRegistryConfig = SchemaRegistryConfig(config)
var rawEventSchema = sparkSchema("my_raw_json_events")
val my_raw_events_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "aa.bbb.ccc.yyy:9092")
.option("subscribe", "my-raw-event")
.option("failOnDataLoss","false")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger",1000)
.load()
.select(from_json($"value".cast("string"),rawEventSchema, Map.empty[String, String])
.alias("C"))
import com.redislabs.provider.redis._
val sc = spark.sparkContext
val stringRdd = sc.fromRedisKV("PARAMETERS:*")
val lookup_map = stringRdd.collect().toMap
val lookup = udf((key: String) => lookup_map.getOrElse(key,"") )
val curated_df = my_raw_events_df
.select(
...
$"C.SystemEntryDate".alias("RecordCreateDate")
,$"C.Profile".alias("ProfileCode")
,**lookup(expr("'PARAMETERS:PROFILE||'||NVL(C.Profile,'')")).alias("ProfileName")**
,$"C.IdentityType"
,lookup(expr("'PARAMETERS:IdentityType||'||NVL(C.IdentityType,'')")).alias("IdentityTypeName")
...
).as("C")
import org.apache.spark.sql.streaming.Trigger
val query = curated_df
.select(to_sr(struct($"*"), "curated_event_sch").alias("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "aa.bbb.ccc.yyy:9092")
.option("topic", "curated-event")
.option("checkpointLocation","/user/spark/checkPointLocation/xyz")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
query.awaitTermination()
One option is to not use spark-redis, but rather lookup in Redis directly. This can be achieved with df.mapPartitions function. You can find some examples for Spark DStreams here https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/. The idea for Structural Streaming is similar. Be careful to handle the Redis connection properly.
Another solution is to do a stream-static join (spark docs):
Instead of collecting the redis rdd to the driver, use the redis dataframe (spark-redis docs) as a static dataframe to be joined with your stream, so it will be like:
val redisStaticDf = spark.read. ...
val streamingDf = spark.readStream. ...
streamingDf.join(redisStaticDf, ...)
Since spark micro-batch execution engine evaluates the query-execution on each trigger, the redis dataframe will fetch the data on each trigger, providing you an up-to-date data (if you will cache the dataframe it won't)

Termination of Structured Streaming queue using Databricks

I would like to understand whether running a cell in a Databricks notebook with the code below and then cancelling it means that the stream reading is over. Or perhaps it does require some explicit closing?
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServers)
.option("subscribe", "topic1")
.load()
display(df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)])
Non-display Mode
It's best to issue this command in a cell:
streamingQuery.stop()
for this type of approach:
val streamingQuery = streamingDF // Start with our "streaming" DataFrame
.writeStream // Get the DataStreamWriter
.queryName(myStreamName) // Name the query
.trigger(Trigger.ProcessingTime("3 seconds")) // Configure for a 3-second micro-batch
.format("parquet") // Specify the sink type, a Parquet file
.option("checkpointLocation", checkpointPath) // Specify the location of checkpoint files & W-A logs
.outputMode("append") // Write only new data to the "file"
.start(outputPathDir)
Otherwise it continues to run - which is the idea of streaming.
I would not stop the cluster as it is all Streams then.
Databricks display Mode
DataBricks have written a nice set of utilities, but you need to do the course to get them. My folly.
display is a databricks thing. Needs format like:
display(myDF, streamName = "myQuery")
then proceed as follows in a separate cell:
println("Looking for %s".format(myStreamName))
for (stream <- spark.streams.active) // Loop over all active streams
if (stream.name == myStreamName) // Single out your stream
{val s = spark.streams.get(stream.id)
s.stop()
}
This will stop the display approach which is write to memory sink.

How to display results of intermediate transformations of streaming query?

I am implementing one usecase to try-out Spark Structured Streaming API.
The source data is read from Kafka topic and after applying some transformations, results written to console.
I want to print the intermediate output along with the final results of the structured streaming query.
Here is the code snippet:
val trips = getTaxiTripDataframe() //this function consumes kafka topic and desrialize the byte array to create dataframe with required columns
val filteredTrips = trips.filter(col("taxiCompany").isNotNull && col("pickUpArea").isNotNull)
val output = filteredTrips
.groupBy("taxiCompany","pickupArea")
.agg(Map("pickupArea" -> "count"))
val query = output.writeStream.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
query.awaitTermination()
I want to print 'filteredTrips' dataframe on console. I tried using .show() method of dataframe, but as it is a dataframe created on streaming data, it is throwing below exception:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Is there any other work around?
Yes, you can create two streams (I am using Spark 2.4.3)
val filteredTrips = trips.filter(col("taxiCompany").isNotNull && col("pickUpArea").isNotNull)
val query1 = filteredTrips
.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
val query2 = filteredTrips
.groupBy("taxiCompany","pickupArea")
.agg(Map("pickupArea" -> "count"))
.writeStream
.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
query1.awaitTermination()
query2.awaitTermination()

How can we get mini-batch time from Structured Streaming

In the Spark streaming, there is forEachRDD with time parameter, where it is possible to take that time and use it for different purposes - metadata, create additional time column in rdd, ...
val stream = KafkaUtils.createDirectStream(...)
stream.foreachRDD { (rdd, time) =>
// update metadata with time
// convert rdd to df and add time column
// write df
}
In Structured Streaming the API
val df: Dataset[Row] = spark
.readStream
.format("kafka")
.load()
df.writeStream.trigger(...)
.outputMode(...)
.start()
How is that possible to get similar time (mini-batch time) data for structured streaming to be able to use it in the same way?
I have searched for a function which offers the possibility to get the batchTime but it doesn't seem to exist yet in the Spark Structured Streaming APIs.
Here's a workaround I used to get the batch time (Let's suppose that the batch interval is 2000 milliseconds) using the foreachBatchwhich allow us to get the batchId :
val now = java.time.Instant.now
val batchInterval = 2000
df.writeStream.trigger(Trigger.ProcessingTime(batchInterval))
.foreachBatch({ (batchDF: DataFrame, batchId: Long) =>
println(now.plusMillis(batchId * batchInterval.milliseconds))
})
.outputMode(...)
.start()
Here's the output :
2019-07-29T17:13:19.880Z
2019-07-29T17:13:21.880Z
2019-07-29T17:13:23.880Z
2019-07-29T17:13:25.880Z
2019-07-29T17:13:27.880Z
2019-07-29T17:13:29.880Z
2019-07-29T17:13:31.880Z
2019-07-29T17:13:33.880Z
2019-07-29T17:13:35.880Z
I hope it helps !

Structured streaming with periodically updated static dataset

Merging streaming with static datasets is a great feature of structured streaming. But on every batch the datasets will be refreshed from the datasources. Since these sources are not always that dynamic it would be a performance gain to cache a static dataset for a specified period of time (or number of batches).
After the specified period/number of batches the dataset is reloaded from the source otherwise retrieved from cache.
In Spark streaming I managed this with a cached dataset and unpersist it after a specified number of batch runs, but for some reason this is not working anymore with structured streaming.
Any suggestions to do this with structured streaming?
I have a developed a solution for another question Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically which might also be helpful to solve your problem:
You could do this by making use of the streaming scheduling capabilities that Structured Streaming provides.
You can trigger the refreshing (unpersist -> load -> persist) of a static Dataframe by creating an artificial "Rate" streams that refreshes the static dataset periodically. The idea is to:
Load the staticDataframe initially and keep as var
Define a method that refreshes the static Dataframe
Use a "Rate" Stream that gets triggered at the required interval (e.g. 1 hour)
Read actual streaming data and perform join operation with static Dataframe
Within that Rate Stream have a foreachBatch sink that calls refresher method
The following code runs fine with Spark 3.0.1, Scala 2.12.10 and Delta 0.7.0.
// 1. Load the staticDataframe initially and keep as `var`
var staticDf = spark.read.format("delta").load(deltaPath)
staticDf.persist()
// 2. Define a method that refreshes the static Dataframe
def foreachBatchMethod[T](batchDf: Dataset[T], batchId: Long) = {
staticDf.unpersist()
staticDf = spark.read.format("delta").load(deltaPath)
staticDf.persist()
println(s"${Calendar.getInstance().getTime}: Refreshing static Dataframe from DeltaLake")
}
// 3. Use a "Rate" Stream that gets triggered at the required interval (e.g. 1 hour)
val staticRefreshStream = spark.readStream
.format("rate")
.option("rowsPerSecond", 1)
.option("numPartitions", 1)
.load()
.selectExpr("CAST(value as LONG) as trigger")
.as[Long]
// 4. Read actual streaming data and perform join operation with static Dataframe
// As an example I used Kafka as a streaming source
val streamingDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value AS STRING) as id", "offset as streamingField")
val joinDf = streamingDf.join(staticDf, "id")
val query = joinDf.writeStream
.format("console")
.option("truncate", false)
.option("checkpointLocation", "/path/to/sparkCheckpoint")
.start()
// 5. Within that Rate Stream have a `foreachBatch` sink that calls refresher method
staticRefreshStream.writeStream
.outputMode("append")
.foreachBatch(foreachBatchMethod[Long] _)
.queryName("RefreshStream")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
To have a full example, the delta table got created as below:
val deltaPath = "file:///tmp/delta/table"
import spark.implicits._
val df = Seq(
(1L, "static1"),
(2L, "static2")
).toDF("id", "deltaField")
df.write
.mode(SaveMode.Overwrite)
.format("delta")
.save(deltaPath)