Connecting HIVE from Spark/Scala

Connecting HIVE from Spark/Scala - scala

I have installed Hadoop-3.3.0 and Hive-3.1.2 in Ubuntu WSL (as windows subsystem).
I have all hadoop, YARN and hiveserver2 demons running in Ubuntu WSL.
In my windows OS (host), I open Scala IDE. Via Spark/Scala, I would like to connect to HIVE tables which are available in Ubuntu WSL.
In Windows, I have nothing related to Hadoop/HIVE installed. Everything is available only in Ubuntu WSL.
Can someone please help how to do this in Scala IDE.
I do this with Maven.
Code I use:
val spark = SparkSession
.builder
.master("local[*]")
.appName("My APP")
.config("spark.sql.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate
spark.sql("show tables").show();
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport
Thanks!

Related

Spark Shell not working after adding support for Iceberg

We are doing POC on Iceberg and evaluating it first time.
Spark Environment:
Spark Standalone Cluster Setup ( 1 master and 5 workers)
Spark: spark-3.1.2-bin-hadoop3.2
Scala: 2.12.10
Java: 1.8.0_321
Hadoop: 3.2.0
Iceberg 0.13.1
As suggested in Iceberg's official documentation, to add support for Iceberg in Spark shell, we are adding Iceberg dependency while launching the Spark shell as below,
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1
After launching the Spark shell with the above command, we are not able to use the Spark shell at all. For all the commands (even non Iceberg) we are getting the same exception as below,
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/BinaryCommand
Below simple command also throwing same exception.
val df : DataFrame = spark.read.json("/spark-3.1.2-bin-hadoop3.2/examples/src/main/resources/people.json")
df.show()
In Spark source code, BinaryCommand class belongs to Spark SQL module, so tried explicitly adding Spark SQL dependency while launching Spark shell as below, but still getting same exception.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1,org.apache.spark:spark-sql_2.12:3.1.2
When we launch spark-shell normally i.e. without Iceberg dependency, then it is working properly.
Any pointer in the right direction for troubleshooting would be really helpful.
Thanks.

We are using the wrong Iceberg version, choose the spark 3.2 iceberg jar but running Spark 3.1. After using the correct dependency version (i.e. 3.1), we are able to launch the Spark shell with Iceberg. Also no need to specify org.apache.spark Spark jars using packages since all of that will be on the classpath anyway.
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:0.13.1

How to create Spark connection str based on configuration?

I have the following config:
Databricks Runtime Version
5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)
Is it a correct connection string for Spark? I've never created it before.
conn_str = "org.apache.spark:spark-avro_2.11:2.4.3,org.mongodb.spark:mongo-spark-connector_2.11:2.4.2"
spark = (
SparkSession.builder
.config("spark.jars.packages", connection_str)
.config("spark.ui.showConsoleProgress", False)
.getOrCreate()
)

If you're using Databricks platform, then the SparkSession is already initialized when you started the cluster and it could be too late to install packages. It's better to install these libraries one by one, using the Libraries tab in the created cluster - use the Maven coordinates part to install org.apache.spark:spark-avro_2.11:2.4.3 and org.mongodb.spark:mongo-spark-connector_2.11:2.4.2 separately. See documentation for details.

Install Local spark on OSX

I am trying to run my Scala job on my local machine (a MacBook pro osx10.13.3) and I am having an error at runtime.
My versions:
scala: 2.11.12
spark: 2.3.0
hadoop: 3.0.0
I installed everything through brew.
The exception is:
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
Happening at those line:
val conf = new SparkConf()
.setAppName(getName)
.setMaster("local[2]")
val context = new SparkContext(conf)
The last line is where the exception is thrown.
My theory is that Hadoop and spark version arent working together but I can't find online what version should Hadoop be for spark 2.3.0.
Thank you.

So I figured out my problem.
So first, yes, I don't need Hadoop installed. Thanks for pointing that out.
And second I had java10 installed instead of java8. Removing it solved the rest of the problems.
Thank you everyone !

Netezza connection with Spark / Scala JDBC

I've set up Spark 2.2.0 on my Windows machine using Scala 2.11.8 on IntelliJ IDE. I'm trying to make Spark connect to Netezza using JDBC drivers.
I've read through this link and added the com.ibm.spark.netezzajars to my project through Maven. I attempt to run the Scala script below just to test the connection:
package jdbc
object SimpleScalaSpark {
def main(args: Array[String]) {
import org.apache.spark.sql.{SparkSession, SQLContext}
import com.ibm.spark.netezza
val spark = SparkSession.builder
.master("local")
.appName("SimpleScalaSpark")
.getOrCreate()
val sqlContext = SparkSession.builder()
.appName("SimpleScalaSpark")
.master("local")
.getOrCreate()
val nzoptions = Map("url" -> "jdbc:netezza://SERVER:5480/DATABASE",
"user" -> "USER",
"password" -> "PASSWORD",
"dbtable" -> "ADMIN.TABLENAME")
val df = sqlContext.read.format("com.ibm.spark.netezza").options(nzoptions).load()
}
}
However I get the following error:
17/07/27 16:28:17 ERROR NetezzaJdbcUtils$: Couldn't find class org.netezza.Driver
java.lang.ClassNotFoundException: org.netezza.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:49)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
Exception in thread "main" java.sql.SQLException: No suitable driver found for jdbc:netezza://SERVER:5480/DATABASE
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:54)
at com.ibm.spark.netezza.NetezzaJdbcUtils$$anonfun$getConnector$1.apply(NetezzaJdbcUtils.scala:46)
at com.ibm.spark.netezza.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at jdbc.SimpleScalaSpark$.main(SimpleScalaSpark.scala:20)
at jdbc.SimpleScalaSpark.main(SimpleScalaSpark.scala)
I have two ideas:
1) I don't don't believe I actually installed any Netezza JDBC driver, though I thought the jars I brought into my project from the link above was sufficient. Am I just missing a driver or am I missing something in my Scala script?
2) In the same link, the author makes mention of starting the Netezza Spark package:
For example, to use the Spark Netezza package with Spark’s interactive
shell, start it as shown below:
$SPARK_HOME/bin/spark-shell –packages
com.ibm.SparkTC:spark-netezza_2.10:0.1.1
–driver-class-path~/nzjdbc.jar
I don't believe I'm invoking any package apart from jdbc in my script. Do I have to add that to my script?
Thanks!

Your 1st idea is right, I think. You almost certainly need to install the Netezza JDBC driver if you have not done this already.
From the link you posted:
This package can be deployed as part of an application program or from
Spark tools such as spark-shell, spark-sql. To use the package in the
application, you have to specify it in your application’s build
dependency. When using from Spark tools, add the package using
–packages command line option. Netezza JDBC driver also should be
added to the application dependencies.
The Netezza driver is something you have to download yourself, and you need support entitlement to get access to it (via IBM's Fix Central or Passport Advantage). It is included in either the Windows driver/client support package, or the linux driver package.

SparkContext cannot be initialized in 'yarn-client' mode called from Scala-IDE

I have installed Cloudera VM (Single node) and inside this VM i have Spark running on top of Yarn. I would like to use Eclipse IDE (with scala plugin) for testing/learning with Spark.
If i instantiate SparkContext as following, everything works as i expected
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster("local[2]")
However, if i want now to connect to local server by changing the master to 'yarn-client' then it does not work:
val master = "yarn-client"
val sparkConf = new SparkConf().setAppName("TwitterPopularTags").setMaster(master)
Specifically im getting following errors:
Error details displayed in the Eclipse console:
Error details from the NodeManager logs:
Here are the things i have tried so far:
1. Dependencies
I added all the dependencies through Maven repository
Cloudera version is 5.5 and corresponding Hadoop version is 2.6.0 and Spark version is 1.5.0.
2. Configurations
I added 3 path variables into Eclipse classpath:
SPARK_CONF_DIR=/etc/spark/conf/
HADOOP_CONF_DIR=/usr/lib/hadoop/
YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn/
Can anybody clarify me what is the problem here and ways to solve it?

I worked around it! I still don't understand what the exact problems is but i created a folder with my username in hadoop , i.e. /user/myusername directory and it worked. Anyway now i switched to Hortonworks distribution and i found it much more smoother to get started with than the Cloudera distribution.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Connecting HIVE from Spark/Scala - scala

Related

Spark Shell not working after adding support for Iceberg

How to create Spark connection str based on configuration?

Install Local spark on OSX

Netezza connection with Spark / Scala JDBC

SparkContext cannot be initialized in 'yarn-client' mode called from Scala-IDE

Categories

Resources