Structured streaming with periodically updated static dataset - scala

Merging streaming with static datasets is a great feature of structured streaming. But on every batch the datasets will be refreshed from the datasources. Since these sources are not always that dynamic it would be a performance gain to cache a static dataset for a specified period of time (or number of batches).
After the specified period/number of batches the dataset is reloaded from the source otherwise retrieved from cache.
In Spark streaming I managed this with a cached dataset and unpersist it after a specified number of batch runs, but for some reason this is not working anymore with structured streaming.
Any suggestions to do this with structured streaming?

I have a developed a solution for another question Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically which might also be helpful to solve your problem:
You could do this by making use of the streaming scheduling capabilities that Structured Streaming provides.
You can trigger the refreshing (unpersist -> load -> persist) of a static Dataframe by creating an artificial "Rate" streams that refreshes the static dataset periodically. The idea is to:
Load the staticDataframe initially and keep as var
Define a method that refreshes the static Dataframe
Use a "Rate" Stream that gets triggered at the required interval (e.g. 1 hour)
Read actual streaming data and perform join operation with static Dataframe
Within that Rate Stream have a foreachBatch sink that calls refresher method
The following code runs fine with Spark 3.0.1, Scala 2.12.10 and Delta 0.7.0.
// 1. Load the staticDataframe initially and keep as `var`
var staticDf = spark.read.format("delta").load(deltaPath)
staticDf.persist()
// 2. Define a method that refreshes the static Dataframe
def foreachBatchMethod[T](batchDf: Dataset[T], batchId: Long) = {
staticDf.unpersist()
staticDf = spark.read.format("delta").load(deltaPath)
staticDf.persist()
println(s"${Calendar.getInstance().getTime}: Refreshing static Dataframe from DeltaLake")
}
// 3. Use a "Rate" Stream that gets triggered at the required interval (e.g. 1 hour)
val staticRefreshStream = spark.readStream
.format("rate")
.option("rowsPerSecond", 1)
.option("numPartitions", 1)
.load()
.selectExpr("CAST(value as LONG) as trigger")
.as[Long]
// 4. Read actual streaming data and perform join operation with static Dataframe
// As an example I used Kafka as a streaming source
val streamingDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value AS STRING) as id", "offset as streamingField")
val joinDf = streamingDf.join(staticDf, "id")
val query = joinDf.writeStream
.format("console")
.option("truncate", false)
.option("checkpointLocation", "/path/to/sparkCheckpoint")
.start()
// 5. Within that Rate Stream have a `foreachBatch` sink that calls refresher method
staticRefreshStream.writeStream
.outputMode("append")
.foreachBatch(foreachBatchMethod[Long] _)
.queryName("RefreshStream")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
To have a full example, the delta table got created as below:
val deltaPath = "file:///tmp/delta/table"
import spark.implicits._
val df = Seq(
(1L, "static1"),
(2L, "static2")
).toDF("id", "deltaField")
df.write
.mode(SaveMode.Overwrite)
.format("delta")
.save(deltaPath)

Related

Read from Kafka topic process the data and write back to Kafka topic using scala and spark

Hi Im reading froma kafka topic and i want to process the data received from kafka such as tockenization, filtering out unncessary data, removing stop words and finally I want to write back to another Kafka topic
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
val df = readStream.selectExpr("CAST(value AS STRING)" )
df.show(false)
val df_json = df.select(from_json(col("value"), mySchema.defineSchema()).alias("parsed_value"))
val df_text = df_json.withColumn("text", col("parsed_value.payload.Text"))
// perform some data processing actions such as tokenization etc and return cleanedDataframe as the final result
// write back to kafka
val writeStream = cleanedDataframe
.writeStream
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("topic", "writing.val")
.start()
writeStream.awaitTermination()
Then I am getting the below error
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Queries with streaming sources must be executed with
writeStream.start();;
Then I have edited my code as follows to read from kafka and write into console
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
// write to console
val df = readStream.selectExpr("CAST(value AS STRING)" )
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination();
// then perform the data processing part as mentioned in the first half
With the second method, continuously data was displaying in the console but it never run through data processing part. Can I know how can I read from a kafka topic and then perform some actions ( tokenization, removing stop words) on the received data and finally writing back to a new kafka topic?
EDIT
Stack Trace is pointing at df.show(false) in the above code during the error
There are two common problems in your current implementation:
Apply show in a streaming context
Code after awaitTermination will not be executed
To 1.
The method show is an action (as opposed to a tranformation) on a dataframe. As you are dealing with streaming dataframes this will cause an error as streaming queries need to be started with start (just as the Excpetion text is telling you).
To 2.
The method awaitTermination is a blocking method which means that subsequent code will not be executed in each micro-batch.
Overall Solution
If you want to read and write to Kafka and in-between want to understand what data is being processed by showing the data in the console you can do the following:
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
// write to console
val df = readStream.selectExpr("CAST(value AS STRING)" )
df.writeStream
.outputMode("append")
.format("console")
.start()
val df_json = df.select(from_json(col("value"), mySchema.defineSchema()).alias("parsed_value"))
val df_text = df_json.withColumn("text", col("parsed_value.payload.Text"))
// perform some data processing actions such as tokenization etc and return cleanedDataframe as the final result
// write back to kafka
// the columns `key` and `value` of the DataFrame `cleanedDataframe` will be used for producing the message into the Kafka topic.
val writeStreamKafka = cleanedDataframe
.writeStream
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("topic", "writing.val")
.start()
existingSparkSession.awaitAnyTermination()
Note the existingSparkSession.awaitAnyTermination() at the very end of the code without using awaitTermination directly after the start. Also, remember that the columns key and value of the DataFrame cleanedDataframe will be used for producing the message into the Kafka topic. However, a column key is not required, see also here
In addition, in case you are using checkpointing (recommended) then you need to have two different locations set: one for the console stream and the other one for the kafka output stream. It is important to keep in mind that those the streaming queries run independently.

Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically

I am building a Spark Structured Streaming application where I am doing a batch-stream join. And the source for the batch data gets updated periodically.
So, I am planning to do a persist/unpersist of that batch data periodically.
Below is a sample code which I am using to persist and unpersist the batch data.
Flow:
Read the batch data
persist the batch data
For every one hour, unpersist the data and read the batch data and persist it again.
But, I am not seeing the batch data getting refreshed for every hour.
Code:
var batchDF = handler.readBatchDF(sparkSession)
batchDF.persist(StorageLevel.MEMORY_AND_DISK)
var refreshedTime: Instant = Instant.now()
if (Duration.between(refreshedTime, Instant.now()).getSeconds > refreshTime) {
refreshedTime = Instant.now()
batchDF.unpersist(false)
batchDF = handler.readBatchDF(sparkSession)
.persist(StorageLevel.MEMORY_AND_DISK)
}
Is there any better way to achieve this scenario in spark structured streaming jobs ?
You could do this by making use of the streaming scheduling capabilities that Structured Streaming provides.
You can trigger the refreshing (unpersist -> load -> persist) of a static Dataframe by creating an artificial "Rate" stream that refreshes the static Dataframe periodically. The idea is to:
Load the static Dataframe initially and keep as var
Define a method that refreshes the static Dataframe
Use a "Rate" Stream that gets triggered at the required interval (e.g. 1 hour)
Read actual streaming data and perform join operation with static Dataframe
Within that Rate Stream have a foreachBatch sink that calls refresher method created in step 2.
The following code runs fine with Spark 3.0.1, Scala 2.12.10 and Delta 0.7.0.
// 1. Load the staticDataframe initially and keep as `var`
var staticDf = spark.read.format("delta").load(deltaPath)
staticDf.persist()
// 2. Define a method that refreshes the static Dataframe
def foreachBatchMethod[T](batchDf: Dataset[T], batchId: Long) = {
staticDf.unpersist()
staticDf = spark.read.format("delta").load(deltaPath)
staticDf.persist()
println(s"${Calendar.getInstance().getTime}: Refreshing static Dataframe from DeltaLake")
}
// 3. Use a "Rate" Stream that gets triggered at the required interval (e.g. 1 hour)
val staticRefreshStream = spark.readStream
.format("rate")
.option("rowsPerSecond", 1)
.option("numPartitions", 1)
.load()
.selectExpr("CAST(value as LONG) as trigger")
.as[Long]
// 4. Read actual streaming data and perform join operation with static Dataframe
// As an example I used Kafka as a streaming source
val streamingDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value AS STRING) as id", "offset as streamingField")
val joinDf = streamingDf.join(staticDf, "id")
val query = joinDf.writeStream
.format("console")
.option("truncate", false)
.option("checkpointLocation", "/path/to/sparkCheckpoint")
.start()
// 5. Within that Rate Stream have a `foreachBatch` sink that calls refresher method
staticRefreshStream.writeStream
.outputMode("append")
.foreachBatch(foreachBatchMethod[Long] _)
.queryName("RefreshStream")
.trigger(Trigger.ProcessingTime("5 seconds")) // or e.g. 1 hour
.start()
To have a full example, the delta table got created and updated with new values as below:
val deltaPath = "file:///tmp/delta/table"
import spark.implicits._
val df = Seq(
(1L, "static1"),
(2L, "static2")
).toDF("id", "deltaField")
df.write
.mode(SaveMode.Overwrite)
.format("delta")
.save(deltaPath)

Spark Structured Streaming dynamic lookup with Redis

i am new to spark.
We are currently building a pipeline :
Read the events from Kafka topic
Enrich this data with the help of Redis-Lookup
Write events to the new Kafka topic
So, my problem is when i want to use spark-redis library it performs very well, but data stays static in my streaming job.
Although data is refreshed at Redis, it does not reflect to my dataframe.
Spark reads data at first then never updates it.
Also i am reading from REDIS data at first,total data about 1mio key-val string.
What kind of approaches/methods i can do, i want to use Redis as in-memory dynamic lookup.
And lookup table is changing almost 1 hour.
Thanks.
used libraries:
spark-redis-2.4.1.jar
commons-pool2-2.0.jar
jedis-3.2.0.jar
Here is the code part:
import com.intertech.hortonworks.spark.registry.functions._
val config = Map[String, Object]("schema.registry.url" -> "http://aa.bbb.ccc.yyy:xxxx/api/v1")
implicit val srConfig:SchemaRegistryConfig = SchemaRegistryConfig(config)
var rawEventSchema = sparkSchema("my_raw_json_events")
val my_raw_events_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "aa.bbb.ccc.yyy:9092")
.option("subscribe", "my-raw-event")
.option("failOnDataLoss","false")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger",1000)
.load()
.select(from_json($"value".cast("string"),rawEventSchema, Map.empty[String, String])
.alias("C"))
import com.redislabs.provider.redis._
val sc = spark.sparkContext
val stringRdd = sc.fromRedisKV("PARAMETERS:*")
val lookup_map = stringRdd.collect().toMap
val lookup = udf((key: String) => lookup_map.getOrElse(key,"") )
val curated_df = my_raw_events_df
.select(
...
$"C.SystemEntryDate".alias("RecordCreateDate")
,$"C.Profile".alias("ProfileCode")
,**lookup(expr("'PARAMETERS:PROFILE||'||NVL(C.Profile,'')")).alias("ProfileName")**
,$"C.IdentityType"
,lookup(expr("'PARAMETERS:IdentityType||'||NVL(C.IdentityType,'')")).alias("IdentityTypeName")
...
).as("C")
import org.apache.spark.sql.streaming.Trigger
val query = curated_df
.select(to_sr(struct($"*"), "curated_event_sch").alias("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "aa.bbb.ccc.yyy:9092")
.option("topic", "curated-event")
.option("checkpointLocation","/user/spark/checkPointLocation/xyz")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
query.awaitTermination()
One option is to not use spark-redis, but rather lookup in Redis directly. This can be achieved with df.mapPartitions function. You can find some examples for Spark DStreams here https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/. The idea for Structural Streaming is similar. Be careful to handle the Redis connection properly.
Another solution is to do a stream-static join (spark docs):
Instead of collecting the redis rdd to the driver, use the redis dataframe (spark-redis docs) as a static dataframe to be joined with your stream, so it will be like:
val redisStaticDf = spark.read. ...
val streamingDf = spark.readStream. ...
streamingDf.join(redisStaticDf, ...)
Since spark micro-batch execution engine evaluates the query-execution on each trigger, the redis dataframe will fetch the data on each trigger, providing you an up-to-date data (if you will cache the dataframe it won't)

Reading kafka topic using spark dataframe

I want to create dataframe on top of kafka topic and after that i want to register that dataframe as temp table to perform minus operation on data. I have written below code. But while querying registered table I'm getting error
"org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"
org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "SERVER ******").option("subscribe", "TOPIC_NAME").option("startingOffsets", "earliest").load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))
val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")
personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show
ERROR:---------- "org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"
Also i have used start() method and I'm getting below error.
20/08/11 00:59:30 ERROR streaming.MicroBatchExecution: Query final_df1 [id = 1a3e2ea4-2ec1-42f8-a5eb-8a12ce0fb3f5, runId = 7059f3d2-21ec-43c4-b55a-8c735272bf0f] terminated with error
java.lang.AbstractMethodError
NOTE: My main objective behind writing this script is i want to write minus query on this data and want to compare it with one of the register table i have on cluster. So , to summarise If I'm sending 1000 records in kafka topic from oracle database, I'm creating dataframe on top of oracle table , registering it as temp table and same I'm doing with kafka topic. Than i want to run minus query between source(oracle) and target(kafka topic). to perform 100% data validation between source and target. (Registering kafka topic as temporary table is possible?)
Use memory sink instead of registerTempTable. Check below code.
org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "SERVER ******")
.option("subscribe", "TOPIC_NAME")
.option("startingOffsets", "earliest")
.load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))
val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")
personDF
.writeStream
.outputMode("append")
.format("memory")
.queryName("final_df1").start()
spark.sql("select * from final_df1").show(10,false)
Streaming DataFrame doesn't support the show() method. When you call start() method, it will start a background thread to stream the input data to the sink, and since you are using ConsoleSink, it will output the data to the console. You don't need to call show().
remove the below lines,
personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show
and add the below or equivalent lines instead,
val query1 = personDF.writeStream.queryName("final_df1").format("memory").outputMode("append").start()
query1.awaitTermination()

Termination of Structured Streaming queue using Databricks

I would like to understand whether running a cell in a Databricks notebook with the code below and then cancelling it means that the stream reading is over. Or perhaps it does require some explicit closing?
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServers)
.option("subscribe", "topic1")
.load()
display(df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)])
Non-display Mode
It's best to issue this command in a cell:
streamingQuery.stop()
for this type of approach:
val streamingQuery = streamingDF // Start with our "streaming" DataFrame
.writeStream // Get the DataStreamWriter
.queryName(myStreamName) // Name the query
.trigger(Trigger.ProcessingTime("3 seconds")) // Configure for a 3-second micro-batch
.format("parquet") // Specify the sink type, a Parquet file
.option("checkpointLocation", checkpointPath) // Specify the location of checkpoint files & W-A logs
.outputMode("append") // Write only new data to the "file"
.start(outputPathDir)
Otherwise it continues to run - which is the idea of streaming.
I would not stop the cluster as it is all Streams then.
Databricks display Mode
DataBricks have written a nice set of utilities, but you need to do the course to get them. My folly.
display is a databricks thing. Needs format like:
display(myDF, streamName = "myQuery")
then proceed as follows in a separate cell:
println("Looking for %s".format(myStreamName))
for (stream <- spark.streams.active) // Loop over all active streams
if (stream.name == myStreamName) // Single out your stream
{val s = spark.streams.get(stream.id)
s.stop()
}
This will stop the display approach which is write to memory sink.