i'm new to Spark and i have one question.
I have Spark Streaming application which uses Kafka. Is there way to tell my application to shut down if new batch is empty (let's say batchDuration = 15 min)?
Something in the lines of should do it:
dstream.foreachRDD{rdd =>
if (rdd.isEmpty) {
streamingContext.stop()
}
}
But be aware that depending on your application workflow, it could be that the first batch (or some batch in between) is also empty and hence your job will stop on the first run. You may need to combine some conditions for a more reliable stop.
Related
I want to make something to monitor some Kafka topic continuously, and then execute some batch job when a message comes in (hitting some REST api and storing response). I set something up with KafkaItemReader, however, it turns off if it doesn't receive a message for 30 seconds based on pollTimeout. How can I make it run indefinitely? Since this is not an obvious option I'm wondering if I am using the right tool for the job.
Likely answer: you are not supposed to do this.
That's correct. Batch processing is about processing finite data sets. If your data source is an infinite stream of records and you want to monitor it continuously, then a streaming solution is more appropriate for your use case.
I am looking a way to trigger my Databricks notebook once to process Kinesis Stream and using following pattern
import org.apache.spark.sql.streaming.Trigger
// Load your Streaming DataFrame
val sdf = spark.readStream.format("json").schema(my_schema).load("/in/path")
// Perform transformations and then write…
sdf.writeStream.trigger(Trigger.Once).format("delta").start("/out/path")
It looks like it's not possible with AWS Kinesis and that's what Databricks documentation suggest as well. My Question is what else can we do to Achieve that?
As you mentioned in the question the trigger once isn't supported for Kinesis.
But you can achieve what you need by adding into the picture the Kinesis Data Firehose that will write data from Kinesis into S3 bucket (you can select format that you need, like, Parquet, ORC, or just leave in JSON), and then you can point the streaming job to given bucket, and use Trigger.Once for it, as it's a normal streaming source (For efficiency it's better to use Auto Loader that is available on Databricks). Also, to have the costs under the control, you can setup retention policy for your S3 destination to remove or archive files after some period of time, like 1 week or month.
A workaround is to stop after X runs, without trigger. It'll guarantee a fix number of rows per run.
The only issue is that if you have millions of rows waiting in the queue you won't have the guarantee to process all of them
In scala you can add an event listener, in python count the number of batches.
from time import sleep
s = sdf.writeStream.format("delta").start("/out/path")
#by defaut keep spark.sql.streaming.numRecentProgressUpdates=100 in the list. Stop after 10 microbatch
#maxRecordsPerFetch is 10 000 by default, so we will consume a max value of 10x10 000= 100 000 messages per run
while len(s.recentProgress) < 10:
print("Batchs #:"+str(len(s.recentProgress)))
sleep(10)
s.stop()
You can have a more advanced logic counting the number of message processed per batch and stopping when the queue is empty (the throughput should lower once it's all consumed as you'll only get the "real-time" flow, not the history)
I’m writing a data source that shares similarities with Spark’s JDBC data source implementation, and I’d like to ask how Spark handles certain failure scenarios. To my understanding, if an executor dies while it’s running a task, Spark will revive the executor and try to re-run that task. However, how does this play out in the context of data integrity and Spark’s JDBC data source API (e.g. df.write.format("jdbc").option(...).save())?
In the savePartition function of JdbcUtils.scala, we see Spark calling the commit and rollback functions of the Java connection object generated from the database url/credentials provided by the user (see below). But if an executor dies right after commit() finishes or before rollback() is called, does Spark try to re-run the task and write the same data partition again, essentially creating duplicate committed rows in the database? And what happens if the executor dies in the middle of calling commit() or rollback()?
try {
...
if (supportsTransactions) {
conn.commit()
}
committed = true
Iterator.empty
} catch {
case e: SQLException =>
...
throw e
} finally {
if (!committed) {
// The stage must fail. We got here through an exception path, so
// let the exception through unless rollback() or close() want to
// tell the user about another problem.
if (supportsTransactions) {
conn.rollback()
}
conn.close()
} else {
...
}
}
But if an executor dies right after commit() finishes or before rollback() is called, does Spark try to re-run the task and write the same data partition again, essentially creating duplicate committed rows in the database?
What would you expect since Spark SQL (which is a high-level API over RDD API) does not really know much about all the peculiarities of JDBC or any other protocol? Not to mention the underlying execution runtime, i.e. the Spark Core.
When you write a structured query like df.write.format(“jdbc”).option(...).save() Spark SQL translates it to a distributed computation using the low-level assembly-like RDD API. Since it tries to embrace as many "protocols" as possible (incl. JDBC), Spark SQL's DataSource API leaves much of the error handling to the data source itself.
The core of Spark that schedules tasks (that does not know or even care what the tasks do) simply watches execution and if a task fails, it will attempt to execute it again (until 3 failed attempts by default).
So when you write a custom data source, you know the drill and have to deal with such retries in your code.
One way to handle errors is to register a task listener using TaskContext (e.g. addTaskCompletionListener or addTaskFailureListener).
I had to introduce some de-duplication logic exactly for the reasons described. You might end up with the same committed twice (or more) indeed.
The doc states that it is possible to schedule multiple jobs from within one Spark Session / context. Can anyone give an example on how to do that? Can I launch the several jobs / Action, within future ? What Execution context should I use? I'm not entirely sure how spark manage that. How the driver or the cluster is aware of the many jobs being submitted from within the same driver. Is there anything that signal spark about it ? If someone has an example that would be great.
Motivation: My data is key-Value based, and has the requirement that for each group associated with a key I need to process them in
batch. In particular, I need to use mapPartition. That's because In each
partition I need to instantiate an non-serializable object for
processing my records.
(1) The fact is, I could indeed, group things using scala collection directly within the partitions, and process each group as a batch.
(2) The other way around, that i am exploring would be to filter the data by keys before end, and launch action/jobs for each of the filtered result (filtered collection). That way no need to group in each partition, and I can just process the all partition as a batch directly. I am assuming that the fair scheduler would do a good job to schedule things evenly between the jobs. If the fair Scheduler works well, i think this solution is more efficient. However I need to test it, hence, i wonder if someone could provide help on how to achieve threading within a spark session, and warn if there is any down side to it.
More over if anyone has had to make that choice/evaluation between the two approach, what was the outcome.
Note: This is a streaming application. each group of record associated with a key needs a specific configuration of an instantiated object, to be processed (imperatively as a batch). That object being non-serializable, it needs to be instantiated per partition
I have an application that creates a few dataframes, writes them to disk, then runs a command using vertica_python to load the data into Vertica. The Spark Vertica connector doesn't work because of an encrypted drive.
What I'd like to do, is have the application run the command to load the data, then move on to the next job immediately. What it's doing however, is waiting for the load to be done in Vertica before moving to the next job. How can I have it do what I want? Thanks.
What's weird about this problem is that the command I'd like to have run in the background is as simple as db_client.cursor.execute(command). This shouldn't be blocking under normal circumstances, so why is it in Spark?
To be very specific, what is happening is that I'm reading in a dataframe, doing transformations, writing to s3, and then I'd like to start the db loading the files from s3, before moving taking the transformed dataframe, transforming it again, writing to s3, loading to db.... multiple times.
I see now what I was doing. Simply putting the dbapi call in its own thread isn't enough. I have to put the other calls that I want to run concurrently in their own threads as well.