Log Snowpark pushdown queries in scala - scala

Using Snowpark to connect to snowflake & perform operations on snowflake data, codebase is in scala.
Is there any way to log pushdown queries.

Yes, there are several ways to do this:
use the Snowflake UI and look at the Activity -> Query History
use explain method on the DataFrame
enable debug logging in Snowpark and check the log

Related

Is there a way for PySpark to give user warning when executing a query on Apache Hive table without specifying partition keys?

We are using Spark SQL with Apache Hive tables (via AWS Glue Data catalog). One problem is that when we execute a Spark SQL query without specifying the partitions to read via the WHERE clause, it gives us/the user no warning about the fact that it will proceed to load all partitions and thus likely time out or fail.
Is there a way to ideally error out, or at least give some warning, when a user executes a Spark SQL query on Apache Hive table without specifying partition keys? It's very easy to forget to do this.
I searched for existing solutions to this and found none, both on Stack Overflow and on the wider internet. I was expecting some configuration option/code that would help me achieve the goal.

Spark SQL - EXPLAIN, DESCRIBE statements not shown in SparkUI

Lately realized the Spark SQL auxiliary statements (EXPLAIN, DESCRIBE, SHOW CREATE etc.,) not shown in Spark UI. I have an use-case to track all the queries executed through Spark SQL JDBC connection; just these statements go untracked.
So, my questions are,
Why these are not shown in Spark UI?
Is it possible to get these statements execution listed in Spark UI?

Upsert to Phoenix table in Apache Spark

Looking to find if anybody got through a way to perform upserts (append / update / partial inserts/update) on Phoenix using Apache Spark. I could see as per Phoenix documentation save SaveMode.Overwrite is only supported - which is overwrite with full load. I tried changing the mode it throws error.
Currently, we have *.hql jobs running to perform this operation, now we want to rewrite them in Spark Scala. Thanks for sharing your valuable inputs.
While Phoenix connector indeed supports only SaveMode.Overwrite, the implementation doesn't conform to the Spark standard, which states that:
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame
If you check the source, you'll see that saveToPhoenix just calls saveAsNewAPIHadoopFile with PhoenixOutputFormat, which
internally builds the UPSERT query for you
In other words SaveMode.Overwrite with Phoenix Connector is in fact UPSERT.

GCP Dataproc spark consuming BigQuery

I'm very new on GCP Google Cloud Platform, so I hope my question will not look so silly.
Footstage:
The main goals is gather few extend tables from BigQuery and apply few transformations. Because of the size of the tables I'm planning use Dataproc deploying a Pyspark script, ideally I would be able to use sqlContext to apply few sql queries to the DFs (tables pulled from BQ). Finally, I could easily dump this info into a file within a data storage bucket.
Questions :
Can I use import google.datalab.bigquery as bq within my Pyspark script?
Is this proposed schema the most efficient or instead I might validate any other? keep in mind that I need to create many temporal queries and this is why I though on Spark.
I expect to use pandas and bq to read the results queries as pandas df following this example. Later, I might use sc.parallelize from Spark to transform the pandas df into a spark df. Is this approach the right one?
my script
Update:
After have a back and forth with #Tanvee that kindly attend this question we conclude that GCP requires an intermediate allocation step when you need to read data from DataStorage into your Dataproc. Briefly, your spark or hadoop script might need a temporal bucket where store the data from the table and then bring it into Spark.
References:
Big Query Connector \
Deployment
thanks so much
You will need to use BigQuery connector for spark. There are some examples in the GCP documentation here and here. It will create RDD which you can convert to dataframe and then you will be able to perform all typical transformations. Hope that helps.
You can directly use following options to connect bigquery table from spark.
You can also use spark-bigquery connectors https://github.com/samelamin/spark-bigquery to directly run your queries on dataproc using spark.
https://github.com/GoogleCloudPlatform/spark-bigquery-connector This is new connector which is in beta. This is spark datasource api to bigquery which is easy to use.
Please refer following link:
Dataproc + BigQuery examples - any available?

How to access the HIVE ACID table in Spark sql?

How could you access the HIVE ACID table, in Spark sql?
We have worked on and open sourced a datasource that will enable users to work on their Hive ACID Transactional tables using Spark.
Github: https://github.com/qubole/spark-acid
It is available as a Spark package and instructions to use it are on the Github page. Currently the datasource supports only reading from Hive ACID tables, and we are working on adding the ability to write into these tables via Spark as well.
Feedback and suggestions are welcome!
#aniket Spark doesn't support reading Hive Acid tables directly. (https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996)
The data layout for transactional tables requires special logic to decide which directories to read and how to combine them correctly. Some data files may represent updates of previously written rows, for example. Also, if you are reading while something is writing to this table your read may fail (w/o the special logic) because it will try to read incomplete ORC files. Compaction may (again w/o the special logic) may make it look like your data is duplicated.
It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991
I faced the same issue (Spark for Hive acid tables )and I can able to manage with JDBC call from Spark. May be I can use this JDBC call from spark until we get the native ACID support from Spark.
https://github.com/Gowthamsb12/Spark/blob/master/Spark_ACID
Spark can read acid table directly at least since spark 2.3.2. But I can aslo confirm it can't read acid table in spark 2.2.0.