Spark SQL - EXPLAIN, DESCRIBE statements not shown in SparkUI - scala

Lately realized the Spark SQL auxiliary statements (EXPLAIN, DESCRIBE, SHOW CREATE etc.,) not shown in Spark UI. I have an use-case to track all the queries executed through Spark SQL JDBC connection; just these statements go untracked.
So, my questions are,
Why these are not shown in Spark UI?
Is it possible to get these statements execution listed in Spark UI?

Related

Log Snowpark pushdown queries in scala

Using Snowpark to connect to snowflake & perform operations on snowflake data, codebase is in scala.
Is there any way to log pushdown queries.
Yes, there are several ways to do this:
use the Snowflake UI and look at the Activity -> Query History
use explain method on the DataFrame
enable debug logging in Snowpark and check the log

Is there a way for PySpark to give user warning when executing a query on Apache Hive table without specifying partition keys?

We are using Spark SQL with Apache Hive tables (via AWS Glue Data catalog). One problem is that when we execute a Spark SQL query without specifying the partitions to read via the WHERE clause, it gives us/the user no warning about the fact that it will proceed to load all partitions and thus likely time out or fail.
Is there a way to ideally error out, or at least give some warning, when a user executes a Spark SQL query on Apache Hive table without specifying partition keys? It's very easy to forget to do this.
I searched for existing solutions to this and found none, both on Stack Overflow and on the wider internet. I was expecting some configuration option/code that would help me achieve the goal.

How to Rollback Insert/Update in spark (Scala) using JDBC

Is there a way to implement commit and rollback in spark when performing DML on RDBMS? I'm working on a task that involve inserting and updating records in Sql Server. I'm running sqoop command in Spark with JDBC.
I want to make provision for failure by implementing rollback feature so that in case the job fails mid way during the DML operation, I want to revert back to the previous stage where the RDBMS was. Can this be done in spark using Scala? I did lot of searches online and I was not able to come up with a solution. Any help or a pointer to where I can get relevant info will be appreciated.
Thanks in advance.

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?

I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
link1
link2
link3
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join
Method 2:
SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark
The best way to implement the solution to your problem as below.
To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.
But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.
Hive support three execution engines
Spark
Tez
Mapreduce
Tez is matured, spark is evolving with various commits from Facebook and community.
Business can understand hive very easily as a query engine as it is much more matured in the industry.
In short use spark to process the data for daily processing and register them with hive.
Create business users in hive.

Use spark-streaming as a scheduler

I have a Spark job that reads from an Oracle table into a dataframe. The way it seems the jdbc.read method works is to pull an entire table in at once, so I constructed a spark-submit job to work in batch. Whenever I have data I need manipulated I put it in a table and run the spark-submit.
However, I would like this to be more event driven...essentially I want it so anytime data is moved into this table it is run through Spark, so I can have events in a UI drive these insertions and spark is just running. I was thinking about using a spark streaming context just to have it watching and operating on the table all the time, but with a long wait between streaming contexts. This way I can use the results (also written to Oracle in part) to trigger a deletion of the read table and not run data more than once.
Is this a bad idea? Will this work? It seems more elegant than using a cron-job.