Spark SQL build for hive? - scala

I have downloaded spark release - 1.3.1 and package type is Pre-build for Hadoop 2.6 and later
now i want to run below scala code using spark shell so i followed this steps
1. bin/spark-shell
2. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Now the problem is if i verity it on hue browser like
select * from src;
then i get
table not found exception
that means table not created how do i configure hive with spark shell to make this successful. i want to use SparkSQL also i need to read and write data from hive.
i randomly heard that we need to copy hive-site.xml file somewhere in spark directory
can someone please explain me with the steps - SparkSQL and Hive configuration
Thanks
Tushar

Indeed, the hive-site.xml direction is correct. Take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables .
Also it sounds like you wish to create a hive table from spark, for that look at "Saving to Persistent Tables" in the same document as above.

Related

Azure databricks - Do we have postgres connector for spark

Azure databricks - Do we have postgres connector for spark
Also, how to upsert/update record in postgres using spark databricks.
I am using Spark 3.1.1
When trying to write using mode=overwrite, it truncates the table but recird is not getting inserted
I am new to this. Please help.
You don't need a separate connector for PostgreSQL - it works via standard JDBC connector, and PostgreSQL JDBC driver should be included into databricks runtime - check release notes for your specific runtime. So you just need to form a correct JDBC URL as described in documentation (Spark documentation also has examples of URL for PostgreSQL).
Something like this:
df.write \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()
Regarding the UPSERT, it's not so simple, not only for PostgreSQL, but also for other databases:
either you do the full join, and select entries not existing only in your dataset, and the rest is taking from the database, and then overwrite - but this is very expensive because you're reading a full database, and writing it back
or you're doing left join with database (you need to read it again) going down to RDD with .foreachPartition/.foreach, and forming a series of INSERT/UPDATE operations depending on if data exist in database, or not - it's doable, but you need more experience.
Specifically for PosgtreSQL you can convert this (foreach) into INSERT ... ON CONFLICT so you won't need to read full database (see their wiki for more information about this operation)
Another approach - write your data into temporary table, and then via JDBC issue the MERGE command to incorporate your changes into the table. This is more "lightweight" method from my point of view.

Is there a way to use Impala rather than Hive in PySpark?

I have queries that work in Impala but not Hive. I am creating a simply PySpark file such as:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext
sconf = SparkConf()
sc = SparkContext.getOrCreate(conf=sconf)
sqlContext = HiveContext(sc)
sqlContext.sql('use db1')
...
When I run this script, it's queries get the errors I get when I run them in the Hive editor (they work in the Impala editor). Is there a way to fix this so that I can run these queries in the script using Impala?
You can use Impala or HiveServer2 in Spark SQL via JDBC Data Source. That requires you to install Impala JDBC driver, and configure connection to Impala in Spark application. But "you can" doesn't mean "you should", because it incurs overhead and creates extra dependencies without any particular benefits.
Typically (and that is what your current application is trying to do), Spark SQL runs against underlying file system directly, not needing to go through either HiveServer2 or Impala coordinators. In this scenario, Spark only (re)uses Hive Metastore to retrieve the metadata -- database and table definitions.

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

What is the way to connect to hive using scala code and execute query into hive?

I checked out this link but did not find anything useful :
HiveClient Documentation
From raw Scala you can use Hive JDBC connector: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC.
One more option is to use Spark Hive context.

Running complex SQL queries on Cassandra tables using Spark SQL

hereI have setup Cassandra and Spark with cassandra- spark connector. I am able to create RDDs using Scala. But I would like to run complex SQL queries (Aggregation/Analytical functions/Window functions) using Spark SQL on Cassandra tables , could you help how should I proceed ?getting error like this
following is the query used :
sqlContext.sql(
"""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test",
| cluster "Test Cluster",
| pushdown "true"
|)""".stripMargin)
below is the error :[enter image description here][2]
new error:
enter image description here
First thing I noticed from your post is that , sqlContext.sql(...) used in your query but your screenshot shows sc.sql(...).
I take screenshot content as your actual issue. In Spark shell, Once you've loaded the shell, both the SparkContext (sc) and the SQLContext (sqlContext) are already loaded and ready to go. sql(...) does't exit in SparkContext so you should try with sqlContext.sql(...).
Most probably in your spark-shell context started as Spark Session and value for that is spark. Try your commands with spark instead of sqlContext.