Unable to read a hive table from zepplin notebook - scala

I am new to spark and I am querying the below command and it is failing with the error:-
val cop_raw = sqlContext.sql("select * from cop.p_id")
cop_raw.show(5)
java.io.IOException:
shadehive.org.apache.hive.service.cli.HiveSQLException: java.io.IOException:
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query:
org.apache.hadoop.hive.ql.parse.ParseException: line 1:400
Failed to recognize predicate 'date'.
Failed rule: 'identifier' in table or column identifier
Can somebody suggest how to fix it?
I could see that by setting the below can fix the issue but I am not sure how to run this command on zeppelin when hive interpreter is not set.
SET hive.support.sql11.reserved.keywords=false

Have you tried:
sqlContext.sql("SET hive.support.sql11.reserved.keywords=false;")
For me this works in Spark2:
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.sql("SET hive.support.sql11.reserved.keywords=false;")

Related

pyspark dataframe error due to java.lang.ClassNotFoundException: org.postgresql.Driver

I want to read data from Postgresql using JDBC and store it in pyspark dataframe. When I want to preview the data in dataframe with methods like df.show(), df.take(), they return an error saying caused by: java.lang.ClassNotFoundException: org.postgresql.Driver. But df.printschema() would return info of the DB table perfectly.
Here is my code:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("spark://spark-master:7077")
.appName("read-postgres-jdbc")
.config("spark.driver.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar")
.config("spark.executor.memory", "1g")
.getOrCreate()
)
sc = spark.sparkContext
df = (
spark.read.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://postgres/postgres")
.option("table", 'public."ASSET_DATA"')
.option("dbtable", _select_sql)
.option("user", "airflow")
.option("password", "airflow")
.load()
)
df.show(1)
Error log:
Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.21.0.6, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
Edited 7/24/2021
The script was executed on JupyterLab in a separated docker container from the Standalone Spark cluster.
You are not using the proper option.
When reading the doc, you see this :
Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
This option is for the driver. This is the reason why the acquisition of the schema works, it is an action done on the driver side. But when you run a spark command, this command is executed by the workers (or executors). They need also to have the .jar to access postgres.
If your postgres driver ("/opt/workspace/postgresql-42.2.18.jar") does not need any dependencies, then you can add it to the worker using spark.jars - I know mysql does not require depencies for example but I never tried postgres. If it needs dependencies, then it is better to call directly the package from maven using spark.jars.packages option. (see the link of the doc for help)
You can also try adding:
.config("spark.executor.extraClassPath", "/opt/workspace/postgresql-42.2.18.jar"
So that the jar is included for your executors as well.

Write Spark Dataframe to PosgreSQL

I'm trying to write a Spark Dataframe to a pre-created PostgreSQL table. I get the following error during the INSERT process of my job :
java.sql.BatchUpdateException: Batch entry 0 INSERT INTO ref.tableA(a,b) VALUES ('Mike',548758) was aborted. Call getNextException to see the cause.
I also tried to catch the error and call the getNextException method but i still have the same error in the logs. In order to write the Dataframe to the corresponding table i used the following process :
val jdbcProps = new java.util.Properties()
jdbcProps.setProperty("driver", Config.psqlDriver)
jdbcProps.setProperty("user", Config.psqlUser)
jdbcProps.setProperty("password", Config.psqlPassword)
jdbcProps.setProperty("stringtype", "unspecified")
df.write
.format("jdbc")
.mode(SaveMode.Append)
.jdbc(Config.psqlUrl, tableName, jdbcProps)
Package versions :
- Spark : 1.6.2
- Scala : 2.10.6
Any ideas ?

Connection to Cassandra from spark Error

I am using Spark 2.0.2 and Cassandra 3.11.2 I am using this code but it give me connection error.
./spark-shell --jars ~/spark/spark-cassandra-connector/spark-cassandra-connector/target/full/scala-2.10/spark-cassandra-connector-assembly-2.0.5-121-g1a7fa1f8.jar
import com.datastax.spark.connector._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val test = sc.cassandraTable("sensorkeyspace", "sensortable")
test.count
When I enter test.count command it give me this error.
java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:168)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$8.apply(CassandraConnector.scala:154)
Can you check the yaml file? It seems the number of enough concurrent connections are open at any instance of time.

PySpark: java.sql.SQLException: No suitable driver

I have spark code which connects to Netezza and reads a table.
conf = SparkConf().setAppName("app").setMaster("yarn-client")
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
nz_df=hc.load(source="jdbc",url=address dbname";username=;password=",dbtable="")
I do spark-submit and run the code in the following way..
spark-submit -jars nzjdbc.jar filename.py
And I get the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o55.load.
: java.sql.SQLException: No suitable driver
Am I doing anything wrong over here?? is the jar not suitable or is it not able to recgonize the jar?? please let me know the correct way if this is not and also can anyone provide the link to get the jar for connecting netezza from spark.
I am using the 1.6.0 version of spark.

HiveContext - unable to access hbase table mapped in hive as external table

I am trying to access the hbase table mapped in hive using HiveContext in Spark. But I am getting ClassNotFoundException Exceptions.. Below is my code.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("select * from dbn.hvehbasetable")
I am getting the below error..
17/06/22 07:17:30 ERROR log: error in initSerDe:
java.lang.ClassNotFoundException Class
org.apache.hadoop.hive.hbase.HBaseSerDe not found
java.lang.ClassNotFoundException: Class
org.apache.hadoop.hive.hbase.HBaseSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:342)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:337)
at scala.Option.map(Option.scala:145)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:337)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:332)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237)
Can anyone help which class I need to import to read the hbase tables.
I think, you need add hive-hbase-handler jar in classpath/ auxpath if you haven't done that already.
Get your version from here.
Let me know if this helps. Cheers.