Is there a way to use Impala rather than Hive in PySpark? - pyspark

I have queries that work in Impala but not Hive. I am creating a simply PySpark file such as:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, HiveContext
sconf = SparkConf()
sc = SparkContext.getOrCreate(conf=sconf)
sqlContext = HiveContext(sc)
sqlContext.sql('use db1')
...
When I run this script, it's queries get the errors I get when I run them in the Hive editor (they work in the Impala editor). Is there a way to fix this so that I can run these queries in the script using Impala?

You can use Impala or HiveServer2 in Spark SQL via JDBC Data Source. That requires you to install Impala JDBC driver, and configure connection to Impala in Spark application. But "you can" doesn't mean "you should", because it incurs overhead and creates extra dependencies without any particular benefits.
Typically (and that is what your current application is trying to do), Spark SQL runs against underlying file system directly, not needing to go through either HiveServer2 or Impala coordinators. In this scenario, Spark only (re)uses Hive Metastore to retrieve the metadata -- database and table definitions.

Related

Prevent pyspark from using in-memory session/docker

We are looking into using Spark as big data processing framework in Azure Synapse Analytics with notebooks. I want to set up a local development environment/sandbox on my own computer similar to that, interacting with Azure Data Lake Storage Gen 2.
For installing Spark I'm using WSL with a Ubuntu distro (Spark seems to be easier to manage in linux)
For notebooks I'm using jupyter notebook with Anaconda
Both components work fine by themself but I can't manage to connect the notebook to my local sparkcluster in WSL. I tried the following:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[1]") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
When examining the spark object it outputs
SparkSession - in-memory
SparkContext
Spark UI
Version v3.3.0
Master local[1]
AppName Python Spark SQL basic example
The spark-ui link points to http://host.docker.internal:4040/jobs/, Also when examining the UI for spark in WSL I can't see any connection. I think there is something I'm missing or not understanding with how pyspark works. Any help would be much appreciated to clarify.
Your are connecting to local instance which is in this case native Windows running jupyter:
.master("local[1]")
Instead, you should connect to your WSL cluster:
.master("spark://localhost:7077") # assuming default port

How to create dataframe in pyspark using odbc

Need to create a dataframe.
Using pyspark or spark.
No use of pandas.
I m using odbc connection.

Connect to SQL Data Warehouse from HDInsight OnDemand

I'm trying to read/write data to an Azure SQL Data Warehouse from a spark on demand HDInsight cluster.
I can do this from a normal HDInsight spark cluster by using a script action to install the jdbc driver but I don't think it's possible to run script actions on the on demand clusters.
I've tried
Copying the files from %user%.m2\repository\com\microsoft\sqlserver\mssql-jdbc\6.2.2.jre8 up to blob storage in a folder called jars next to where the built spark code is.
including the driver dependency in the built jar file
Both of these led to a java.lang.NoClassDefFoundError
I'm not too familiar with scala/maven/JVM/etc so not sure what else to try or include in this SO question.
Scala code i'm trying to run is
val sqlContext = SparkSession.builder().appName("GenerateEventsSql").getOrCreate()
val jdbcSqlConnStr = "jdbc:sqlserver://someserver.database.windows.net:1433;databaseName=myDW;user=admin;password=XXXX;"
val tableName = "dbo.SomeTable"
val allTableData = sqlContext.read.format("jdbc")
.options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConnStr, "dbtable" -> tableName)
)
.load()
Jars on Blob storage folder are not accessible to the Class path of HDinsight spark job. You need to copy the jar files to the local host for example /tmp/jars/xyz.jar and mention the same in Spark-submit command.
For e.g.
nohup spark-submit --jars /tmp/jars/xyz.jar

Running complex SQL queries on Cassandra tables using Spark SQL

hereI have setup Cassandra and Spark with cassandra- spark connector. I am able to create RDDs using Scala. But I would like to run complex SQL queries (Aggregation/Analytical functions/Window functions) using Spark SQL on Cassandra tables , could you help how should I proceed ?getting error like this
following is the query used :
sqlContext.sql(
"""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test",
| cluster "Test Cluster",
| pushdown "true"
|)""".stripMargin)
below is the error :[enter image description here][2]
new error:
enter image description here
First thing I noticed from your post is that , sqlContext.sql(...) used in your query but your screenshot shows sc.sql(...).
I take screenshot content as your actual issue. In Spark shell, Once you've loaded the shell, both the SparkContext (sc) and the SQLContext (sqlContext) are already loaded and ready to go. sql(...) does't exit in SparkContext so you should try with sqlContext.sql(...).
Most probably in your spark-shell context started as Spark Session and value for that is spark. Try your commands with spark instead of sqlContext.

Spark SQL build for hive?

I have downloaded spark release - 1.3.1 and package type is Pre-build for Hadoop 2.6 and later
now i want to run below scala code using spark shell so i followed this steps
1. bin/spark-shell
2. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
3. sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Now the problem is if i verity it on hue browser like
select * from src;
then i get
table not found exception
that means table not created how do i configure hive with spark shell to make this successful. i want to use SparkSQL also i need to read and write data from hive.
i randomly heard that we need to copy hive-site.xml file somewhere in spark directory
can someone please explain me with the steps - SparkSQL and Hive configuration
Thanks
Tushar
Indeed, the hive-site.xml direction is correct. Take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables .
Also it sounds like you wish to create a hive table from spark, for that look at "Saving to Persistent Tables" in the same document as above.