pyspark foreachBatch reading same data again after restarting the stream - pyspark

df = spark.readStream.option("readChangeFeed", "true").option("startingVersion", 2).load(tablePath)
def foreach_batch_function(df, epoch_id):
print("epoch_id: ", epoch_id)
df.write.mode("append").json("/mnt/sample/data/test/")
df.writeStream.foreachBatch(foreach_batch_function).start()
When I terminate the writestream and run again foreachBatch processes same data again. How can we maintain checkpoints avoid this situation of reading old data?

This answers my question.
https://kb.databricks.com/streaming/checkpoint-no-cleanup-foreachbatch.html
You should manually specify the checkpoint directory with the checkpointLocation option.
streamingDF.writeStream.option("checkpointLocation","<checkpoint-path>").outputMode("append").foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write.format("parquet").mode("overwrite").save(output_directory)
}.start()

Related

Structured Streaming metrics understanding

I'm quite new to Structured Streaming and would like to understand a bit more in detail the main metrics of Spark.
I have a Structured Streaming process in Databricks that reads events from one Eventhub, read values from those events, creates a new df and writes this new df into a second Eventhub.
The event that comes from the first Eventhub, is an eventgrid event from which I read a url (when a blob is added to a storage account) and inside a foreachBatch, I create a new DF and write it to the second Eventhub.
The code has the following structure:
val streamingInputDF =
spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
.select(($"body").cast("string"))
def get_func( batchDF:DataFrame, batchID:Long ) : Unit = {
batchDF.persist()
for (row <- batchDF.rdd.collect) { //necessary to read the file with spark.read....
val file_url = "/mnt/" + path
// create df from readed url
val df = spark
.read
.option("rowTag", "Transaction")
.xml(file_url)
if (!(df.rdd.isEmpty)){
// some filtering
val eh_df = df.select(col(...).as(...),
val eh_jsoned = eh_df.toJSON.withColumnRenamed("value", "body")
// write to Eventhub
eh_jsoned.select("body")
.write
.format("eventhubs")
.options(eventHubsConfWrite.toMap)
.save()
}
}
batchDF.unpersist()
}
val query_test= streamingSelectDF
.writeStream
.queryName("query_test")
.foreachBatch(get_func _)
.start()
I have tried adding the maxEventsPerTrigger(100) parameter but this increases a lot the time from when the data arrives to the Storage Account until it is consumed in Databricks.
The value for maxEventsPerTrigger is set randomly in order to test behaviour.
Having seen the metrics, what sense does it make that the batch time is increasing so much and the processing rate and input rate are similar?
What approach should I consider to improve the process?
I'm running it from a Databricks 7.5 Notebook, Spark 3.0.1 and Scala 2.12.
Thank you all very much in advance.
NOTE:
XML files have the same size
First Eventhub has 20 partitions
Rate data input to first Eventhub is 2 events/sec

How can we get mini-batch time from Structured Streaming

In the Spark streaming, there is forEachRDD with time parameter, where it is possible to take that time and use it for different purposes - metadata, create additional time column in rdd, ...
val stream = KafkaUtils.createDirectStream(...)
stream.foreachRDD { (rdd, time) =>
// update metadata with time
// convert rdd to df and add time column
// write df
}
In Structured Streaming the API
val df: Dataset[Row] = spark
.readStream
.format("kafka")
.load()
df.writeStream.trigger(...)
.outputMode(...)
.start()
How is that possible to get similar time (mini-batch time) data for structured streaming to be able to use it in the same way?
I have searched for a function which offers the possibility to get the batchTime but it doesn't seem to exist yet in the Spark Structured Streaming APIs.
Here's a workaround I used to get the batch time (Let's suppose that the batch interval is 2000 milliseconds) using the foreachBatchwhich allow us to get the batchId :
val now = java.time.Instant.now
val batchInterval = 2000
df.writeStream.trigger(Trigger.ProcessingTime(batchInterval))
.foreachBatch({ (batchDF: DataFrame, batchId: Long) =>
println(now.plusMillis(batchId * batchInterval.milliseconds))
})
.outputMode(...)
.start()
Here's the output :
2019-07-29T17:13:19.880Z
2019-07-29T17:13:21.880Z
2019-07-29T17:13:23.880Z
2019-07-29T17:13:25.880Z
2019-07-29T17:13:27.880Z
2019-07-29T17:13:29.880Z
2019-07-29T17:13:31.880Z
2019-07-29T17:13:33.880Z
2019-07-29T17:13:35.880Z
I hope it helps !

Spark structured streaming - UNION two or more streaming sources

I'm using spark 2.3.2 and running into an issue doing a union on 2 or more streaming sources from Kafka. Each of these are streaming sources from Kafka that I've already transformed and stored in Dataframes.
I'd ideally want to store the results of this UNIONed dataframe in parquet format in HDFS or potentially even back into Kafka. The ultimate goal is to store these merged events with as low a latency as possible.
val finalDF = flatDF1
.union(flatDF2)
.union(flatDF3)
val query = finalDF.writeStream
.format("parquet")
.outputMode("append")
.option("path", hdfsLocation)
.option("checkpointLocation", checkpointLocation)
.option("failOnDataLoss", false)
.start()
query.awaitTermination()
when doing a writeStream to console instead of parquet I'm getting the expected results, but the example above causes an assertion failure.
Caused by: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.sql.execution.streaming.OffsetSeq.toStreamProgress(OffsetSeq.scala:42)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:185)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:124)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
here is the class and assertion that is failing:
case class OffsetSeq(offsets: Seq[Option[Offset]], metadata: Option[OffsetSeqMetadata] = None) {
assert(sources.size == offsets.size)
Is this because the checkpoint is only storing the offsets for one of the dataframes? Looking through the Spark Structured Streaming documentation it looked like it was possible to do joins/union of streaming sources in Spark 2.2 or >
First, please define how your case class OffsetSeq is related to the code with the unions of the dataframes.
Next, checkpointing is a real issue when performing this union and then writing to Kafka with writestream. Separating into multiple writestreams - each with it's own checkpointing - confuses batch id's because of the union operating. Using the same writestream with union of dataframes fails with checkpointing since the checkpoint appears to seek all the models that generated the dataframes before the union and cannot distinguish what row/record came from what dataframe/model.
For writing to Kafka, from structured sql streaming unioned dataframes - best to use writestream with foreach and ForEachWriter including the Kafka Producer in the process method. No checkpointing is needed; the application the just uses temp checkpoint files which are set to be deleted when appropriate - set "forceDeleteTempCheckpointLocation" to true - in the session builder.
Anyway, I have just set up scala code to union an arbitrary number of streaming dataframes and then write to Kafka Producer. Appears to work well once all Kafka Producer code is placed in the ForEachWriter process method so that it can be serialized by Spark.
val output = dataFrameModelArray.reduce(_ union _)
val stream: StreamingQuery = output
.writeStream.foreach(new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(row: Row): Unit = {
val producer: KafkaProducer[String, String] = new KafkaProducer[String, String](props)
val record = new ProducerRecord[String, String](producerTopic, row.getString(0), row.getString(1))
producer.send(record)
}
def close(errorOrNull: Throwable): Unit = {
}
}
).start()
Can add more logic in process method if needed.
Note prior to union, all dataframes to be unioned have been converted into key, value string columns. Value is a json string of the message data to be sent over the Kafka Producer. This is also very important to get write before the union is attempted.
svcModel.transform(query)
.select($"key", $"uuid", $"currentTime", $"label", $"rawPrediction", $"prediction")
.selectExpr("key", "to_json(struct(*)) AS value")
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
where svcModel is a dataframe in the dataFrameModelArray.

Input data received all in lowercase on spark streaming in databricks using DataFrame

My spark streaming application consumes data from an aws kenisis and is deployed in databricks. I am using the org.apache.spark.sql.Row.mkString method to consume the data and the whole data is received in lowercase. The actual input had camel case field name and values but is received in lowercase on consuming.
I have tried consuming from a simple java application and is receiving the data in the correct from from the kinesis queue. The issue is only in the spark streaming application using DataFrames and running in databricks.
// scala code
val query = dataFrame
.selectExpr("lcase(CAST(data as STRING)) as krecord")
.writeStream
.foreach(new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(row: Row) = {
logger.info("Record received in data frame is -> " + row.mkString)
processDFStreamData(row.mkString, outputHandler, kBase, ruleEvaluator)
}
def close(errorOrNull: Throwable): Unit = {
}
})
.start()
Expectation is the spark streaming input json should be in the same case
letter (camel case)as the data in the kinesis , it should not be converted to lower case once received using data frame.
Any thought's on what might be causing this?
Fixed the issue, the lcase used in the select expression was the culprit, updated the code as below and it worked.
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(new ForeachWriter[Row] {
.........

Spark: Write each record in RDD to individual files in HDFS directory

I have a requirement where I want to write each individual records in an RDD to an individual file in HDFS.
I did it for the normal filesystem but obviously, it doesn't work for HDFS.
stream.foreachRDD{ rdd =>
if(!rdd.isEmpty()) {
rdd.foreach{
msg =>
val value = msg._2
println(value)
val fname = java.util.UUID.randomUUID.toString
val path = dir + fname
write(path, value)
}
}
}
where write is a function which writes to the filesystem.
Is there a way to do it within spark so that for each record I can natively write to the HDFS, without using any other tool like Kafka Connect or Flume??
EDIT: More Explanation
For eg:
If my DstreamRDD has the following records,
abcd
efgh
ijkl
mnop
I need different files for each record, so different file for "abcd", different for "efgh" and so on.
I tried creating an RDD within the streamRDD but I learnt it's not allowed as the RDD's are not serializable.
You can forcefully repartition the rdd to no. of partitions as many no. of records and then save
val rddCount = rdd.count()
rdd.repartition(rddCount).saveAsTextFile("your/hdfs/loc")
You can do in couple of ways..
From rdd, you can get the sparkCOntext, once you got the sparkCOntext, you can use parallelize method and pass the String as List of String.
For example:
val sc = rdd.sparkContext
sc.parallelize(Seq("some string")).saveAsTextFile(path)
Also, you can use sqlContext to convert the string to DF then write in the file.
for Example:
import sqlContext.implicits._
Seq(("some string")).toDF