How to Rollback Insert/Update in spark (Scala) using JDBC - scala

Is there a way to implement commit and rollback in spark when performing DML on RDBMS? I'm working on a task that involve inserting and updating records in Sql Server. I'm running sqoop command in Spark with JDBC.
I want to make provision for failure by implementing rollback feature so that in case the job fails mid way during the DML operation, I want to revert back to the previous stage where the RDBMS was. Can this be done in spark using Scala? I did lot of searches online and I was not able to come up with a solution. Any help or a pointer to where I can get relevant info will be appreciated.
Thanks in advance.

Related

Is there a way for PySpark to give user warning when executing a query on Apache Hive table without specifying partition keys?

We are using Spark SQL with Apache Hive tables (via AWS Glue Data catalog). One problem is that when we execute a Spark SQL query without specifying the partitions to read via the WHERE clause, it gives us/the user no warning about the fact that it will proceed to load all partitions and thus likely time out or fail.
Is there a way to ideally error out, or at least give some warning, when a user executes a Spark SQL query on Apache Hive table without specifying partition keys? It's very easy to forget to do this.
I searched for existing solutions to this and found none, both on Stack Overflow and on the wider internet. I was expecting some configuration option/code that would help me achieve the goal.

Spark SQL - EXPLAIN, DESCRIBE statements not shown in SparkUI

Lately realized the Spark SQL auxiliary statements (EXPLAIN, DESCRIBE, SHOW CREATE etc.,) not shown in Spark UI. I have an use-case to track all the queries executed through Spark SQL JDBC connection; just these statements go untracked.
So, my questions are,
Why these are not shown in Spark UI?
Is it possible to get these statements execution listed in Spark UI?

Upsert to Phoenix table in Apache Spark

Looking to find if anybody got through a way to perform upserts (append / update / partial inserts/update) on Phoenix using Apache Spark. I could see as per Phoenix documentation save SaveMode.Overwrite is only supported - which is overwrite with full load. I tried changing the mode it throws error.
Currently, we have *.hql jobs running to perform this operation, now we want to rewrite them in Spark Scala. Thanks for sharing your valuable inputs.
While Phoenix connector indeed supports only SaveMode.Overwrite, the implementation doesn't conform to the Spark standard, which states that:
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame
If you check the source, you'll see that saveToPhoenix just calls saveAsNewAPIHadoopFile with PhoenixOutputFormat, which
internally builds the UPSERT query for you
In other words SaveMode.Overwrite with Phoenix Connector is in fact UPSERT.

Use spark-streaming as a scheduler

I have a Spark job that reads from an Oracle table into a dataframe. The way it seems the jdbc.read method works is to pull an entire table in at once, so I constructed a spark-submit job to work in batch. Whenever I have data I need manipulated I put it in a table and run the spark-submit.
However, I would like this to be more event driven...essentially I want it so anytime data is moved into this table it is run through Spark, so I can have events in a UI drive these insertions and spark is just running. I was thinking about using a spark streaming context just to have it watching and operating on the table all the time, but with a long wait between streaming contexts. This way I can use the results (also written to Oracle in part) to trigger a deletion of the read table and not run data more than once.
Is this a bad idea? Will this work? It seems more elegant than using a cron-job.

Apache Spark and MongoDB integration using pymongo

i have a problem working with apache spark and mongodb using the pymongo library. Actually, i am processing thousands of records and for each record, i need to read its corresponding data from the database, update certain info and save it back to the database. Due to the reads and writes, i choosed to use Pymongo instead of using Spark-Mongo Connector which apparently isnt well suited for this task. Unfortunately however, when performing writes, mongodb always returns write successful but when i check the database, some updates where not performed. After debugging for over a week, i realized by setting the server to a single core processor, all writes were successful and written in the database but the application has become tremendously slow.
I would like to know if anyone knows how to solve this issue. Thanks in advance