How to access the data from streaming query in "memory" table for subsequent batch queries? - scala

Given a writeStream call:
val outDf = (sdf.writeStream
.outputMode(outputMode)
.format("memory")
.queryName("MyInMemoryTable")
.trigger(Trigger.ProcessingTime(interval))
.start())
How can I run a sql against the MyInMemoryTable e.g.
val df = spark.sql("""select Origin,Dest,Carrier,avg(DepDelay) avgDepDelay
from MyInMemoryTable group by 1,2,3""")
The documentation for Spark Structured Streaming says that batch and streaming queries can be intermixed but the above is not working:
'writeStream' can be called only on streaming Dataset/DataFrame;
org.apache.spark.sql.AnalysisException: 'writeStream' can be called only
on streaming Dataset/DataFrame;
So how can the InMemoryTable be used in subsequent queries?

The following post on Hortonworks site has an approach that seems promising https://community.hortonworks.com/questions/181979/spark-structured-streaming-formatmemory-is-showing.html
Here is the sample writeStream - which is of the same form as my original question:
StreamingQuery initDF = df.writeStream()
.outputMode("append")
.format("memory")
.queryName("initDF")
.trigger(Trigger.ProcessingTime(1000))
.start();
sparkSession.sql("select * from initDF").show();
initDF.awaitTermination();
And here is the response:
Okay,the way it works is :
In simple terms,think that The main Thread of your code launches
another thread in which your streamingquery logic runs.
meanwhile ,your maincode is blocking due to
initDF.awaitTermination().
sparkSession.sql("select * from initDF").show() => This code run on the mainthread ,and it reaches there only for the first time.
So update your code to :
StreamingQuery initDF = df.writeStream() .outputMode("append") .format("memory") .queryName("initDF") .trigger(Trigger.ProcessingTime(1000)) .start();
while(initDF.isActive){
Thread.sleep(10000)
sparkSession.sql("select * from initDF").show()
}
Now the main thread of your code will be going through the loop over and over again and it queries the table.
Applying the suggestions to my code results in :
while(outDf.isActive) {
Thread.sleep(30000)
strmSql(s"select * from $table", doCnt = false, show = true, nRows = 200)
}
outDf.awaitTermination(1 * 20000)
Update This worked great: I am seeing updated results after each mini batch.

Related

pyspark.sql.utils.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFram

I am having glue streaming job, and I need to write the data as stream but after applying some processing, so I did the following:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
glueContext.forEachBatch(
frame=data_frame_DataSource0,
batch_function=processBatch,
options={
"windowSize": window_size,
"checkpointLocation": s3_path_spark
}
)
and in processBatch I do some processing and at the end of it i do the following:
df.writeStream.format("hudi").options(**combinedConf).outputMode('append').start()
I am getting the following error:
pyspark.sql.utils.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFram
as far as I unserstand that the df I am trying to write is not streaming that's why it's giving the error, I am not so aware how can I change it from the glue context and how I can apply the processing on the streaming data then writeStream it?
any idea please?
forEachBatch method processes streaming Dataset/DataFrame on batches of Dataset/DataFrame, so when we call it on data_frame_DataSource0, the df passed to the method processBatch is a normal Dataset/DataFrame contains a batch of data.
You have two options to fix this:
deal with the df as normal dataframe:
df.write.format("hudi").options(**combinedConf).mode("append").save()
apply your stream processing directly on data_frame_DataSource0:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
(
data_frame_DataSource0.writeStream.format("hudi")
.options(**combinedConf)
.option("inferSchema", "true")
.option("startingPosition", starting_position_of_kinesis_iterator)
.outputMode('append').start()
)

Joining streaming with cached static dataset in Spark Structured Streaming not working

I am trying to implement a solution in Spark Structured Streaming that refreshes a static dataset every 5 minutes and joins with a streaming dataset that is running every 10 seconds.
I have tried to follow the solution marked on Structured streaming with periodically updated static dataset
This is my code:
Dataset<Row> kafkaDS = sparkSession.readStream().format("kafka")
.load()
.select(from_avro(col("value"), abrisConfig).as("event"))
.select("event.*");
AtomicReference<Dataset<Row>> cachedDS = new AtomicReference<>();
cachedDS.set(ss.read().format("jdbc")
.options(<options here>)
.load()
.repartition(<repartition columns>)
.sortWithinPartitions(<repartition columns>)
.persist());
VoidFunction2<Dataset<Long>, Long> refresh = new VoidFunction2<Dataset<Long>, Long>() {
private static final long serialVersionUID = -6106370300512290330L;
#Override
public void call(Dataset<Long> v1, Long v2) throws Exception {
System.out.println("Refreshing cache");
cachedDS.get().unpersist();
cachedDS.set(ss.read().format("jdbc")
.options(<options here>)
.load()
.repartition(<repartition columns>)
.sortWithinPartitions(<repartition columns>)
.persist());
}
};
Dataset<Long> staticRefreshStream = ss.readStream()
.format("rate")
.option("rowsPerSecond", 1)
.option("numPartitions", 1)
.load()
.selectExpr("CAST(value as LONG) as trigger")
.as(Encoders.LONG());
//We do here JOIN
Dataset<TcrDenegadasFiltroValue> leftJoinDS = cachedDS.get().join(kafkaDS)//Avoid details about the join
staticRefreshStream.writeStream()
.outputMode("append")
.foreachBatch(refresh)
.queryName("refreshCache")
.trigger(Trigger.ProcessingTime(5, TimeUnit.MINUTES))
.start();
StreamingQuery ds = leftJoinDS
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream()
.format("kafka")
.options(<kafka options>)
.start();
The following code runs with Spark 2.4.5, in a kubernetes environment with 3 executor pods and a driver pod.
The problem is that when I start the application, my main stream, the one that is running every 10 seconds, does the join with the cached dataset, but it seems that it does not know that the static dataset is cached, because it is not reading it from the cache and takes a while to do the job.
Can you help with it?

Spark streaming store method only work in Duration window but not in foreachRDD workflow in customized receiver

I define a receiver to read data from Redis.
part of receiver simplified code:
class MyReceiver extends Receiver (StorageLevel.MEMORY_ONLY){
override def onStart() = {
while(!isStopped) {
val res = readMethod()
if (res != null) store(res.toIterator)
// using res.foreach(r => store(r)) the performance is almost the same
}
}
}
My streaming workflow:
val ssc = new StreamingContext(spark.sparkContext, new Duration(50))
val myReceiver = new MyReceiver()
val s = ssc.receiverStream(myReceiver)
s.foreachRDD{ r =>
r.persist()
if (!r.isEmpty) {
some short operations about 1s in total
// note this line ######1
}
}
I have a producer which produce much faster than consumer so that there are plenty records in Redis now, I tested with number 10000. I debugged, and all records could quickly be read after they are in Redis by readMethod() above. However, in each microbatch I can only get 30 records. (If store is fast enough it should get all of 10000)
With this suspect, I added a sleep 10 seconds code Thread.sleep(10000) to ######1 above. Each microbatch still gets about 30 records, and each microbatch process time increases 10 seconds. And if I increase the Duration to 200ms, val ssc = new StreamingContext(spark.sparkContext, new Duration(200)), it could get about 120 records.
All of these shows spark streaming only generate RDD in Duration? After gets RDD and in main workflow, store method is temporarily stopped? But this is a great waste if it is true. I want it also generates RDD (store) while the main workflow is running.
Any ideas?
I cannot leave a comment simply I don't have enough reputation. Is it possible that propertyspark.streaming.receiver.maxRate is set somewhere in your code ?

Why memory sink writes nothing in append mode?

I used Spark's structured streaming to stream messages from Kafka. The data was then aggregated, and wrote to a memory sink with append mode. However, when I tried to query the memory, it returned nothing. Below are the code:
result = model
.withColumn("timeStamp", col("startTimeStamp").cast("timestamp"))
.withWatermark("timeStamp", "5 minutes")
.groupBy(window(col("timeStamp"), "5 minutes").alias("window"))
.agg(
count("*").alias("total")
);
// writing to memory
StreamingQuery query = result.writeStream()
.outputMode(OutputMode.Append())
.queryName("datatable")
.format("memory")
.start();
// query data in memory
new Timer().scheduleAtFixedRate(new TimerTask() {
#Override
public void run() {
sparkSession.sql("SELECT * FROM datatable").show();
}
}, 10000, 10000);
The result is always:
|window|total|
+------+-----+
+------+-----+
If I used outputMode = complete, then I could get the aggregated data. But that's not my choice as the requirement is to use append mode.
Is there any problem with the code?
Thanks,
In append mode,
The output of a windowed aggregation is delayed the late threshold specified in withWatermark()
In your case, the delay is 5 minutes, I know nothing about your input data, but I guess you probably need to wait for 5 minutes.
I suggest you read (again?) the docs for Structured Streaming:

Monitoring Structured Streaming

I have a structured stream set up that is running just fine, but I was hoping to monitor it while it is running.
I have built an EventCollector
class EventCollector extends StreamingQueryListener{
override def onQueryStarted(event: QueryStartedEvent): Unit = {
println("Start")
}
override def onQueryProgress(event: QueryProgressEvent): Unit = {
println(event.queryStatus.prettyJson)
}
override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {
println("Term")
}
I have built an EventCollector and added the listener to my spark session
val listener = new EventCollector()
spark.streams.addListener(listener)
Then I fire off the query
val query = inputDF.writeStream
//.format("console")
.queryName("Stream")
.foreach(writer)
.start()
query.awaitTermination()
However, onQueryProgress never gets hit. onQueryStarted does, but I was hoping to get the progress of the query at a certain interval to monitor how the queries are doing. Can anyone assist with this?
After much research into this topic, this is what I have found...
OnQueryProgress gets hit in between queries. I am not sure if this intentional functionality or not, but while we are streaming data from a file, the OnQueryProgress does not fire off.
A solution I have found was to rely on the foreach writer sink and perform my own analysis of performance within the process function. Unfortunately, we cannot access specific information about the query that is running. Or, I have not figured out how to yet. This is what I have implemented in my sandbox to analyze performance:
val writer = new ForeachWriter[rawDataRow] {
def open(partitionId: Long, version: Long):Boolean = {
//We end up here in between files
true
}
def process(value: rawDataRow) = {
counter += 1
if(counter % 1000 == 0) {
val currentTime = System.nanoTime()
val elapsedTime = (currentTime - startTime)/1000000000.0
println(s"Records Written: $counter")
println(s"Time Elapsed: $elapsedTime seconds")
}
}
}
An alternative way to get metrics:
Another way to get information about the running queries is to hit the GET endpoint that spark provides us.
http://localhost:4040/metrics
or
http://localhost:4040/api/v1/
Documentation here: http://spark.apache.org/docs/latest/monitoring.html
Update Number 2 Sept 2017:
Tested on regular spark streaming, not structured streaming
Disclaimer, this may not apply to structured streaming, I need to set up a test bed to confirm. However, it does work with regular spark streaming(Consuming from Kafka in this example).
I believe, since spark streaming 2.2 has been released, new endpoints exist that can retrieve more metrics on the performance of the stream. This may have existed in previous versions and I just missed it, but I wanted to make sure it was documented for anyone else searching for this information.
http://localhost:4040/api/v1/applications/{applicationIdHere}/streaming/statistics
This is the endpoint that looks like it was added in 2.2 (Or it already existed and was just added the documentation, I'm not sure, I haven't checked).
Anyways, it adds metrics in this format for the streaming application specified:
{
"startTime" : "2017-09-13T14:02:28.883GMT",
"batchDuration" : 1000,
"numReceivers" : 0,
"numActiveReceivers" : 0,
"numInactiveReceivers" : 0,
"numTotalCompletedBatches" : 90379,
"numRetainedCompletedBatches" : 1000,
"numActiveBatches" : 0,
"numProcessedRecords" : 39652167,
"numReceivedRecords" : 39652167,
"avgInputRate" : 771.722,
"avgSchedulingDelay" : 2,
"avgProcessingTime" : 85,
"avgTotalDelay" : 87
}
This gives us the ability to build our own custom metric/monitoring applications using the REST endpoints that are exposed by Spark.