Hive table created with Spark not visible from HUE / Hive GUI - scala

I am creating a hive table from scala using the next code:
val spark = SparkSession
.builder()
.appName("self service")
.enableHiveSupport()
.master("local")
.getOrCreate()
spark.sql("CREATE TABLE default.TEST_TABLE (C1 INT)")
The table must be successfully created, because if I run this code twice I receive an error saying the table already exists.
However, when I try to access this table from the GUI (HUE), I cannot see any table in Hive, so it seems it's being saved in a different path that the used by Hive in HUE to get this information.
Do you know what should I do to see the tables I create from my code from the HUE/Hive web GUI?
Any help will be very appreciated.
Thank you very much.

I seems to me you have not added hive-site.xml to the proper path.
Hive-site has the properties that spark need to connect successfully with Hive and you should add this to the directory
SPARK_HOME/conf/
You can also add this file by using spark.driver.extraClassPath and give the directory where this file exists. For example in pyspark submit
/usr/bin/spark2-submit \
--conf spark.driver.extraClassPath=/../ Directory with Hive-site.xml / \
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG \
--executor-cores n myScript.py

Related

Spark to Synapse "truncate" not working as expected

I have a simple requirement to write a dataframe from spark (databricks) to a synapse dedicated pool table and keep refreshing (truncating) it on daily basis without dropping it.
Documentation suggests to use truncate with overwrite mode but that doesn't seem to be working as expected for me. As i continue to see table creation date getting updated
I am using
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", synapse_jdbc) \
.option("tempDir", tempDir) \
.option("useAzureMSI", "true") \
.option("dbTable", table_name) \
.mode("overwrite") \
.option("truncate","true") \
.save()
But there doesn't seem to be any difference whether i use truncate or not. Creation date/time of the table in synapse gets updated everytime i execute the above from databricks. Can anyone please help with this, what am i missing?
I already have a workaround which works but seems more like a workaround
.option("preActions", "truncate table "+table_name) \
.mode("append") \
I tried to reproduce your scenario in my environment and the truncate is not working for me with the synapse connector.
While researching this issue I found that not all Options are supported to synapse connector In the Official Microsoft document the provided the list of supported options like dbTable, query, user, password, url, encrypt=true, jdbcDriver, tempDir, tempCompression, forwardSparkAzureStorageCredentials, useAzureMSI , enableServicePrincipalAuth, etc.
Truncate ate option is supported to the jdbc format nots synapse connector.
When I change the format form com.databricks.spark.sqldw to jdbc it's working fine now.
My Code:
df.write.format("jdbc")
.option("url",synapse_jdbc)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", table_name)
.option("tempDir", tempdir)
.option("truncate","true")
.mode("overwrite")
.save()
First execution:
Second execution:
conclusion: For both the time when code is executed table creation time is same means overwrite is not dropping table it is truncating table

Prevent pyspark from using in-memory session/docker

We are looking into using Spark as big data processing framework in Azure Synapse Analytics with notebooks. I want to set up a local development environment/sandbox on my own computer similar to that, interacting with Azure Data Lake Storage Gen 2.
For installing Spark I'm using WSL with a Ubuntu distro (Spark seems to be easier to manage in linux)
For notebooks I'm using jupyter notebook with Anaconda
Both components work fine by themself but I can't manage to connect the notebook to my local sparkcluster in WSL. I tried the following:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[1]") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
When examining the spark object it outputs
SparkSession - in-memory
SparkContext
Spark UI
Version v3.3.0
Master local[1]
AppName Python Spark SQL basic example
The spark-ui link points to http://host.docker.internal:4040/jobs/, Also when examining the UI for spark in WSL I can't see any connection. I think there is something I'm missing or not understanding with how pyspark works. Any help would be much appreciated to clarify.
Your are connecting to local instance which is in this case native Windows running jupyter:
.master("local[1]")
Instead, you should connect to your WSL cluster:
.master("spark://localhost:7077") # assuming default port

How to change a mongo-spark connection configuration from a databricks python notebook

I succeeded at connecting to mongodb from spark, using the mongo-spark connector from a databricks notebook in python.
Right now I am configuring the mongodb uri in an environment variable, but it is not flexible, since I want to change the connection parameter right in my notebook.
I read in the connector documentation that it is possible to override any values set in the SparkConf.
How can I override the values from python?
You don't need to set anything in the SparkConf beforehand*.
You can pass any configuration options to the DataFrame Reader or Writer eg:
df = sqlContext.read \
.option("uri", "mongodb://example.com/db.coll) \
.format("com.mongodb.spark.sql.DefaultSource") \
.load()
* This was added in 0.2

Spark cannot find the postgres jdbc driver

EDIT: See the edit at the end
First of all, I am using Spark 1.5.2 on Amazon EMR and using Amazon RDS for my postgres database. Second is that I am a complete newbie in this world of Spark and Hadoop and MapReduce.
Essentially my problem is the same as for this guy:
java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL
So the dataframe is loaded, but when I try to evaluate it (doing df.show(), where df is the dataframe) gives me the error:
java.sql.SQLException: No suitable driver found for jdbc:postgresql://mypostgres.cvglvlp29krt.eu-west-1.rds.amazonaws.com:5432/mydb
I should note that I start spark like this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar
The solutions suggest delivering the jar onto the worker nodes and setting the classpath on them somehow, which I don't really understand how to do. But then they say that apparently the issue was fixed in Spark 1.4, and I'm using 1.5.2, and still having this issue, so what is going on?
EDIT: Looks like I resolved the issue, however I still don't quite understand why this works and the thing above doesn't, so I guess my questions is now why does doing this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar --conf spark.driver.extraClassPath=/home/hadoop/postgresql-9.4.1207.jre7.jar --jars /home/hadoop/postgresql-9.4.1207.jre7.jar
solve the problem? I just added the path as a parameter into some more of the flags it seems.
spark-shell --driver-class-path .... --jars ... works because all jar files listed in --jars are automatically distributed over the cluster.
Alternatively you could use
spark-shell --packages org.postgresql:postgresql:9.4.1207.jre7
and specify driver class as an option for DataFrameReader / DataFrameWriter
val df = sqlContext.read.format("jdbc").options(Map(
"url" -> url, "dbtable" -> table, "driver" -> "org.postgresql.Driver"
)).load()
or even manually copy required jars to the workers and place these somewhere on the CLASSPATH.

Exception after Setting property 'spark.sql.hive.metastore.jars' in 'spark-defaults.conf'

Given below is the version of Spark & Hive I have installed in my system
Spark : spark-1.4.0-bin-hadoop2.6
Hive : apache-hive-1.0.0-bin
I have configured the Hive installation to use MySQL as Metastore. The goal is to access the MySQL Metastore & execute HiveQL queries inside spark-shell(using HiveContext)
So far I am able to execute the HiveQL queries by accessing the Derby Metastore(As described here, believe Spark-1.4 comes bundled with Hive 0.13.1 which in turn uses the internal Derby database as Metastore)
Then I tried to point spark-shell to my external Metastore(MySQL in this case) by setting the property(as suggested here) given below in $SPARK_HOME/conf/spark-defaults.conf,
spark.sql.hive.metastore.jars /home/mountain/hv/lib:/home/mountain/hp/lib
I have also copied $HIVE_HOME/conf/hive-site.xml into $SPARK_HOME/conf. But I am getting the following exception when I start the spark-shell
mountain#mountain:~/del$ spark-shell
Spark context available as sc.
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/session/SessionState when creating Hive client
using classpath: file:/home/mountain/hv/lib/, file:/home/mountain/hp/lib/
Please make sure that jars for your version of hive and hadoop are
included in the paths passed to spark.sql.hive.metastore.jars.
Am I missing something (or) not setting the property spark.sql.hive.metastore.jars correctly?
Note: In Linux Mint verified.
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
Corrupted version of hive-site.xml will cause this... please copy the correct hive-site.xml