HiveException when running a sql example in Spark shell - scala

a newbie in apache spark here! I am using Spark 2.4.0 and Scala version 2.11.12, and I'm trying to run the following code in my spark shell -
import org.apache.spark.sql.SparkSession
import spark.implicits._
var df = spark.read.json("storesales.json")
df.createOrReplaceTempView("storesales")
spark.sql("SELECT * FROM storesales")
And I get the following error -
2018-12-18 07:05:03 WARN Hive:168 - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.
hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java
:62)
I also saw this Issues trying out example in Spark-shell and as per the accepted answer, I have tried to start my spark shell like so,
~/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --conf spark.sql.warehouse.dir=file:///tmp/spark-warehouse, however, it did not help and the issue persists.

Related

SparkContext not initializing in zeppelin

While I am trying to import SparkContext from pyspark on apache zeppelin, following error is shown:
You should use the spark.pyspark interpreter instead of the python one if you need a spark context.
Try:
%spark.pyspark
print(sc)

Spark 2.2.0 unable to connect to Phoenix 4.11.0 version in loading the table to DF

I'm using the below techstack and trying to connect Phoenix tables using PySpark code. I have downloaded the following jars from the url and tried executing the below code. In logs the connection to hbase is established but the console is stuck with out doing nothing. Please let me know if anybody encountered and fixed similar issue.
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/4.11.0-HBase-1.2
jars:
phoenix-spark-4.11.0-HBase-1.2.jar
phoenix-client.jar
Tech Stack all running in same host:
Apache Spark 2.2.0 Version
Hbase 1.2 Version
Phoenix 4.11.0 Version
Copied the hbase-site.xml in the folder path /spark/conf/hbase-site.xml.
Command executed ->
usr/local/spark> spark-submit phoenix.py --jars /usr/local/spark/jars/phoenix-spark-4.11.0-HBase-1.2.jar --jars /usr/local/spark/jars/phoenix-client.jar
Phoenix.py:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.apache.phoenix.spark").option("table",
"schema.table1").option("zkUrl", "localhost:2181").load()
df.show()
Error log: Hbase Connection is established, however in the console it is stuck and timing out error is thrown
18/07/30 12:28:15 WARN HBaseConfiguration: Config option "hbase.regionserver.lease.period" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
18/07/30 12:28:54 INFO RpcRetryingCaller: Call exception, tries=10, retries=35, started=38367 ms ago, cancelled=false, msg=row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=master01,16020,1532591192223, seqNum=0
Take a look at these answers :
phoenix jdbc doesn't work, no exceptions and stuck
HBase Java client - unknown host: localhost.localdomain
Both of the issues happened in Java (with JDBC), but it looks like it's a similar issue here.
Try to add ZooKeeper hostname (master01, as I see in the error message) to your /etc/hosts :
127.0.0.1 master01
if you are running all your stack locally.

How to use s3 with Apache spark 2.2 in the Spark shell

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
I have consulted the following resources:
Parsing files from Amazon S3 with Apache Spark
How to access s3a:// files from Apache Spark?
Hortonworks Spark 1.6 and S3
Cloudera
Custom s3 endpoints
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key
spark.hadoop.fs.s3a.secret.key=secret-key
I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
In the shell, here is how I try to load data from the S3 bucket:
val p = spark.read.textFile("s3a://sparkcookbook/person")
And here is the error that results:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
When I instead try to start the Spark shell as follows:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
:: problems summary ::
:::: ERRORS
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
And here is the second:
val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
Could someone suggest how to get this working? Thanks.
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

PySpark: java.sql.SQLException: No suitable driver

I have spark code which connects to Netezza and reads a table.
conf = SparkConf().setAppName("app").setMaster("yarn-client")
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
nz_df=hc.load(source="jdbc",url=address dbname";username=;password=",dbtable="")
I do spark-submit and run the code in the following way..
spark-submit -jars nzjdbc.jar filename.py
And I get the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o55.load.
: java.sql.SQLException: No suitable driver
Am I doing anything wrong over here?? is the jar not suitable or is it not able to recgonize the jar?? please let me know the correct way if this is not and also can anyone provide the link to get the jar for connecting netezza from spark.
I am using the 1.6.0 version of spark.

Connecting to AWS Redshift with Zeppelin Spark 2.0 and Pyspark

I need to read Redshift data into dataframes in Zeppelin. For the last several months I've been using Spark 2.0 via Zeppelin on AWS to open csv and json S3 files successfully.
I used to be able to connect to Redshift from Zeppelin on AWS EMR with Spark 1.6.2 (maybe 1.6.1), using this code:
%pyspark
from pyspark.sql import SQLContext, Row
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
#Load the data
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = sqlContext.read.format('jdbc').options(url='jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password', dbtable=aquery).load()
dfMinDates.show()
and it worked. That was summer of 2016.
I haven't had need of it since then and now AWS has Spark 2.0.
The new syntax is
myDF = spark.read.jdbc like this:
%pyspark
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = spark.read.jdbc("jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password", dbtable=aquery).load()
dfMinDates.show()
but I get this error:
Py4JJavaError: An error occurred while calling o119.jdbc. :
java.sql.SQLException: No suitable driver at
java.sql.DriverManager.getDriver(DriverManager.java:315) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:211) at
java.lang.Thread.run(Thread.java:745) (, Py4JJavaError(u'An error occurred
while calling o119.jdbc.\n', JavaObject id=o121), )
I researched the Spark 2.0 documentation, and found this:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
I don't know how to implement this and did more reading from various posts, some blogs and some posts in stackoverflow and found this:
spark.driver.extraClassPath = org.postgresql.Driver
I did this in the Interpreter settings page of Zeppelin, but still I get the same error.
I tried to add a Postgres Interpreter, and I'm not sure I did it right (because I wasn't sure whether to put it in the Spark interpreter or Python interpreter), and I chose the Spark interpreter. Now the Postgres interpreter also has all the same settings as the Spark interpreter, which might not matter, but still I get the same error.
In Spark 1.6, I just don't remember going through all this trouble.
As an experiment, I spun up an EMR cluster with Spark 1.6.2 and tried the old code that used to work, and got the same error as above!
The Zeppelin site has Postgres covered but their information looks like code rather than how to set up the interpreters, so I don't know how to use it.
I'm out of ideas and references.
Any suggestions are much appreciated!
You need to use Amazon's Redshift specific driver. You can download it from here: http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html.
However, if you're using EMR it's already in place (at /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar) and you can just tell Zeppelin where it is.
Here's how to declare it: AWS Redshift driver in Zeppelin