How does Spark handle failure scenarios involving JDBC data source? - scala

I’m writing a data source that shares similarities with Spark’s JDBC data source implementation, and I’d like to ask how Spark handles certain failure scenarios. To my understanding, if an executor dies while it’s running a task, Spark will revive the executor and try to re-run that task. However, how does this play out in the context of data integrity and Spark’s JDBC data source API (e.g. df.write.format("jdbc").option(...).save())?
In the savePartition function of JdbcUtils.scala, we see Spark calling the commit and rollback functions of the Java connection object generated from the database url/credentials provided by the user (see below). But if an executor dies right after commit() finishes or before rollback() is called, does Spark try to re-run the task and write the same data partition again, essentially creating duplicate committed rows in the database? And what happens if the executor dies in the middle of calling commit() or rollback()?
try {
...
if (supportsTransactions) {
conn.commit()
}
committed = true
Iterator.empty
} catch {
case e: SQLException =>
...
throw e
} finally {
if (!committed) {
// The stage must fail. We got here through an exception path, so
// let the exception through unless rollback() or close() want to
// tell the user about another problem.
if (supportsTransactions) {
conn.rollback()
}
conn.close()
} else {
...
}
}

But if an executor dies right after commit() finishes or before rollback() is called, does Spark try to re-run the task and write the same data partition again, essentially creating duplicate committed rows in the database?
What would you expect since Spark SQL (which is a high-level API over RDD API) does not really know much about all the peculiarities of JDBC or any other protocol? Not to mention the underlying execution runtime, i.e. the Spark Core.
When you write a structured query like df.write.format(“jdbc”).option(...).save() Spark SQL translates it to a distributed computation using the low-level assembly-like RDD API. Since it tries to embrace as many "protocols" as possible (incl. JDBC), Spark SQL's DataSource API leaves much of the error handling to the data source itself.
The core of Spark that schedules tasks (that does not know or even care what the tasks do) simply watches execution and if a task fails, it will attempt to execute it again (until 3 failed attempts by default).
So when you write a custom data source, you know the drill and have to deal with such retries in your code.
One way to handle errors is to register a task listener using TaskContext (e.g. addTaskCompletionListener or addTaskFailureListener).

I had to introduce some de-duplication logic exactly for the reasons described. You might end up with the same committed twice (or more) indeed.

Related

Flink: posting messages to an external API: custom sink or lambda function

We are developing a pipeline in apache flink (datastream API) that needs to sends its messages to an external system using API calls. Sometimes such an API call will fail, in this case our message needs some extra treatment (and/or a retry).
We had a few options for doing this:
We map() our stream through a function that does the API call and get the result of the API call returned, so we can act upon failures subsequently (this was my original idea, and why i did this: flink scala map with dead letter queue)
We write a custom sink function that does the same.
However, both options have problems i think:
With the map() approach i won't be able to get exactly once (or at most once which would also be fine) semantics since flink is free to re-execute pieces of pipelines after recovering from a crash in order to get the state up to date.
With the custom sink approach i can't get a stream of failed API calls for further processing: a sink is a dead end from the flink APPs point of view.
Is there a better solution for this problem ?
The async i/o operator is designed for this scenario. It's a better starting point than a map.
There's also been recent work done to develop a generic async sink, see FLIP-171. This has been merged into master and will be released as part of Flink 1.15.
One of those should be your best way forward. Whatever you do, don't do blocking i/o in your user functions. That causes backpressure and often leads to performance problems and checkpoint failures.

Kafka Transaction in case multi Threading

I am trying to create kafka producer in trasnsaction i.e. i want to write a group of msgs if anyone fails i want to rollback all the msg.
kafkaProducer.beginTransaction();
try
{
// code to produce to kafka topic
}
catch(Exception e)
{
kafkaProducer.abortTransaction();
}
kafkaProducer.commitTransaction();
The problem is for single thread above works just fine, but when multiple threads writes it throws exception
Invalid transaction attempted from state IN_TRANSITION to IN_TRANSITION
while debugging I found that if the thread1 transaction is in progress and thread2 also says beingTransaction it throws this exception. What I dont find if how to solve this issue. One possible thing I could find is creating a pool of produce.
Is there any already available API for kafka producer pool or i will have to create my own.
Below is the improvement jira already reported for this.
https://issues.apache.org/jira/browse/KAFKA-6278
Any other suggestion will be really helpful
You can only have a single transaction in progress at a time with a producer instance.
If you have multiple threads doing separate processing and they all need exactly once semantics, you should have a producer instance per thread.
Not sure if this was resolved.
you can use apache common pool2 to create a producer instance pool.
In the create() method of the factory implementation you can generate and assign a unique transactionalID to avoid a conflict (ProducerFencedException)

Flink StreamingEnvronment does not terminate when using Kafka as source

I am using Flink for a Streaming Application. When creating the stream from a Collection or a List, the application terminates and everything after the "env.execute" gets executed normally.
I need to use a different source for the stream. More precisely, I use Kafka as a source (env.addSource(...)). In this case, the program just block while reaching the end of the stream.
I created an appropriate Deserialization Schema for my stream, having an extra event that signals the end of the stream.
I know that the isEndOfStream() condition succeeds on that point (I have an appropriate message printed on the screen in this case).
At this point the program just stops and does nothing, and so the commands that follow the "execute" line aren't on my disposal.
I am using Flink 1.7.2 and the flink-connector-kafka_2.11, with Scala 2.11.12. I am executing using the IntelliJ environment and Maven.
While researching, I found a suggestion to throw an exception while getting at the end of the stream (using the Schema's capabilities). That does not support my goal because I also have more operators/commands within the execution of the environment that need to be executed (and do get executed correctly at this point). If i choose to disrupt the program by throwing an exception I would lose everything else.
After the execution line I use the .getNetRuntime() function to measure the running time of my operators within the stream.
I need to have StreamingEnvironment end like it does when using a List as a source. Is there a way to remove Kafka at this point for example?

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

Shutting down Spark Streaming

i'm new to Spark and i have one question.
I have Spark Streaming application which uses Kafka. Is there way to tell my application to shut down if new batch is empty (let's say batchDuration = 15 min)?
Something in the lines of should do it:
dstream.foreachRDD{rdd =>
if (rdd.isEmpty) {
streamingContext.stop()
}
}
But be aware that depending on your application workflow, it could be that the first batch (or some batch in between) is also empty and hence your job will stop on the first run. You may need to combine some conditions for a more reliable stop.