How to collect all records at spark executor and process it as batch - scala

In my spark kinesis streaming application I am using foreachBatch to get the streaming data and need to send it to the drools rule engine for further processing.
My requirement is, I need to accumulate all json data in a list/ruleSession and send it for rule engine for processing as a batch at the executor side.
//Scala Code Example:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreachBatch(function)
.start()
query.awaitTermination()
val function = (batchDF: DataFrame, batchId: Long) => {
val ruleSession = kBase.newKieSession() //Drools Rule Session, this is getting created at driver side
batchDF.foreach(row => { // This piece of code is being run in executor.
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Getting a null pointer exception here as the ruleSession is not available in executor.
}
)
ruleHandler.processRule(ruleSession) // Again this is in the driver scope.
}
In the above code, the problem I am facing is: the function used in foreachBatch is getting executed at driver side and the code inside batchDF.foreach is getting executed at worker/executor side, and thus failing to get he ruleSession.
Is there any way to run the whole function at each executor side?
OR
Is there a better way to accumulate all the data in a batch DataFrame after transformation and send it to next process from within the executor/worker?

I think this might work ... Rather than running foreach, you could use foreachBatch or foreachPartition (or or a map version like mapPartition if you want return info). In this portion, open a connection to the drools system. From that point, iterate over the dataset within each partition (or batch) sending each to the drools system (or you might send that whole chunk to drools). In the foreachPartition / foreachBatch section, at the end, close the connect (if applicable).

#codeaperature, This is how I achieved batching, inspired from your answer, posting it as an answer as this exceeds the word limit in a comment.
Using foreach on dataframe and passing in a ForeachWriter.
Initializing the rule session in open method of ForeachWriter.
Adding each input JSON to rule session in process method.
Execute the rule in close method with the rule session loaded with batch of data.
//Scala code:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(dataConsumer)
.start()
val dataConsumer = new ForeachWriter[Row] {
var ruleSession: KieSession = null;
def open(partitionId: Long, version: Long): Boolean = { // first open is called once for every batch
ruleSession = kBase.newKieSession()
true
}
def process(row: Row) = { // the process method will be called for a batch of records
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Add all input json to rule session.
}
def close(errorOrNull: Throwable): Unit = { // after calling process for all records in bathc close is called
val factCount = ruleSession.getFactCount
if (factCount > 0) {
ruleHandler.processRule(ruleSession) //batch processing of rule
}
}
}

Related

Input data received all in lowercase on spark streaming in databricks using DataFrame

My spark streaming application consumes data from an aws kenisis and is deployed in databricks. I am using the org.apache.spark.sql.Row.mkString method to consume the data and the whole data is received in lowercase. The actual input had camel case field name and values but is received in lowercase on consuming.
I have tried consuming from a simple java application and is receiving the data in the correct from from the kinesis queue. The issue is only in the spark streaming application using DataFrames and running in databricks.
// scala code
val query = dataFrame
.selectExpr("lcase(CAST(data as STRING)) as krecord")
.writeStream
.foreach(new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(row: Row) = {
logger.info("Record received in data frame is -> " + row.mkString)
processDFStreamData(row.mkString, outputHandler, kBase, ruleEvaluator)
}
def close(errorOrNull: Throwable): Unit = {
}
})
.start()
Expectation is the spark streaming input json should be in the same case
letter (camel case)as the data in the kinesis , it should not be converted to lower case once received using data frame.
Any thought's on what might be causing this?
Fixed the issue, the lcase used in the select expression was the culprit, updated the code as below and it worked.
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(new ForeachWriter[Row] {
.........

How to use update output mode with FileFormat format?

I am trying to use spark structured streaming in update output mode write to a file. I found this StructuredSessionization example and it works fine as long as the console format is configured. But if I change the output mode to:
val query = sessionUpdates
.writeStream
.outputMode("update")
.format("json")
.option("path", "/work/output/data")
.option("checkpointLocation", "/work/output/checkpoint")
.start()
I get following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source json does not support Update output mode;
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:279)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:286)
at palyground.StructuredStreamingMergeSpans$.main(StructuredStreamingMergeSpans.scala:84)
at palyground.StructuredStreamingMergeSpans.main(StructuredStreamingMergeSpans.scala)
Can i use update mode and use the FileFormat to write the result table to a file sink?
In the source code i found a pattern match that ensures Append Mode.
You cannot write to file in update mode using spark structured streaming. You need to write ForeachWriter for it. I have written simple for each writer here. You can modify it according to your requirement.
val writerForText = new ForeachWriter[Row] {
var fileWriter: FileWriter = _
override def process(value: Row): Unit = {
fileWriter.append(value.toSeq.mkString(","))
}
override def close(errorOrNull: Throwable): Unit = {
fileWriter.close()
}
override def open(partitionId: Long, version: Long): Boolean = {
FileUtils.forceMkdir(new File(s"src/test/resources/${partitionId}"))
fileWriter = new FileWriter(new File(s"src/test/resources/${partitionId}/temp"))
true
}
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.foreach(writerForText)
.start()
Append output mode is required for any of the FileFormat sinks, incl. json, which Spark Structured Streaming validates before starting your streaming query.
if (outputMode != OutputMode.Append) {
throw new AnalysisException(
s"Data source $className does not support $outputMode output mode")
}
In Spark 2.4, you could use DataStreamWriter.foreach operator or the brand new DataStreamWriter.foreachBatch operator that simply accepts a function that accepts the Dataset of a batch and the batch ID.
foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T]

Spark Structured Streaming Run Final Operation on External Table (MSCK REPAIR)

Is there a way in Spark's structured streaming to add a final operation to a DataStreamWriter's query plan? I'm attempting to read from a streaming data source, enrich the data in some way, and then write back to a partitioned, external table (assume Hive) in parquet format. The write operation works just fine, partitioning the data in directories for me, but I can't seem to figure out how to additionally run an MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION operation after writing the data to disk for any new partitions that may have been created.
For simplicity's sake, take the following Scala code as an example:
SparkSession
.builder()
.appName("some name")
.enableHiveSupport()
.getOrCreate()
.readStream
.format("text")
.load("/path/from/somewhere")
// additional transformations
.writeStream
.format("parquet")
.partitionBy("some_column")
.start("/path/to/somewhere")
<-------------------- something I can place here for an additional operation?
.awaitTermination()
Potential workarounds?:
1: Maybe using something like .foreach(new ForeachWriter[Row]) and passing a FileStreamSink or something similar would work (using the def close() to run an external query), but I haven't looked into it enough to get a good grasp on using it. - using ForeachWriter doesn't result in the close() method being called after a batch completes.
2: Forking the stream. Something along the lines of the following:
val stream = SparkSession
.builder()
.appName("some name")
.enableHiveSupport()
.getOrCreate()
.readStream
.format("text")
.load("/path/from/somewhere")
// additional transformations
stream
.writeStream
.format("parquet")
.partitionBy("some_column")
.start("/path/to/somewhere")
.awaitTermination()
stream
.map(getPartitionName).distinct
.map { partition =>
// Run query here
partition
}
.writeStream
.start()
.awaitTermination()
The problem here would be ensuring the first operation completes before the second.
3: Naming the query and attaching a listener for completed batches which manually adds all partitions. A bit of a waste, but potentially viable?
...
stream
.writeStream
.queryName("SomeName")
...
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = Unit
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.name == "SomeName") {
// search through files in filesystem and add partitions
fileSystem.listDir("/path/to/directory").foreach { partition =>
// run "ALTER TABLE ADD PARTITION $partition"
}
}
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = Unit
})
I didn't see anything in the documentation that covers this, hopefully I didn't miss anything. Thanks in advance.
Using a StreamingQueryListener works, though I'm not sure if it's good/bad practice.
I implemented something along the lines of this:
spark.streams.addListener(new StreamingQueryListener() {
val client = new Client()
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = Unit
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = Unit
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.numInputRows > 0 && event.progress.sink.description.startsWith("FileSink") && event.progress.sink.description.contains("/path/to/write/directory")) {
client.sql(s"MSCK REPAIR TABLE $db.$table")
}
}
})
If you happen to have time-based partitions, this works decently as long as you intend to create partitions based on now():
spark.streams.addListener(new StreamingQueryListener() {
val client = new Client()
var lastPartition: String = ""
val dateTimeFormat: String = "yyyy-MM-dd"
override def onQueryStarted...
override onQueryTerminated...
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.numInputRows > 0 && event.progress.sink.description.startsWith("FileSink[s3") && event.progress.sink.description.contains("/path/to/write/directory")) {
val newPartition = new DateTime().toString(dateTimeFormat)
if (newPartition != lastPartition) {
client.sql(s"ALTER TABLE $db.$table ADD IF NOT EXISTS PARTITION ($partitionColumn='$newPartition')")
lastPartition = newPartition
}
}
}

Structured Streaming - Foreach Sink

I am basically reading from a Kafka source, and dumping each message through to my foreach processor (Thanks Jacek's page for the simple example).
If this actually works, i shall actually perform some business logic in the process method here, however, this doesn't work. I believe that the println doesn't work since its running on executors and there is no way for getting those logs back to driver. However, this insert into a temp table should at least work and show me that the messages are actually consumed and processed through to the sink.
What am I missing here ?
Really looking for a second set of eyes to check my effort here:
val stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", Util.getProperty("kafka10.broker"))
.option("subscribe", src_topic)
.load()
val rec = stream.selectExpr("CAST(value AS STRING) as txnJson").as[(String)]
val df = stream.selectExpr("cast (value as string) as json")
val writer = new ForeachWriter[Row] {
val scon = new SConConnection
override def open(partitionId: Long, version: Long) = {
true
}
override def process(value: Row) = {
println("++++++++++++++++++++++++++++++++++++" + value.get(0))
scon.executeUpdate("insert into rs_kafka10(miscCol) values("+value.get(0)+")")
}
override def close(errorOrNull: Throwable) = {
scon.closeConnection
}
}
val yy = df.writeStream
.queryName("ForEachQuery")
.foreach(writer)
.outputMode("append")
.start()
yy.awaitTermination()
Thanks for comments from Harald and others, I found out a couple of things, which led me to achieve normal processing behaviour -
test code with local mode, yarn isnt the biggest help in debugging
for some reason, the process method of foreach sink doesnt allow calling other methods. When i put my business logic directly in there, it works.
hope it helps others.

Spark is duplicating work

I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.