Connecting to AWS Redshift with Zeppelin Spark 2.0 and Pyspark - postgresql

I need to read Redshift data into dataframes in Zeppelin. For the last several months I've been using Spark 2.0 via Zeppelin on AWS to open csv and json S3 files successfully.
I used to be able to connect to Redshift from Zeppelin on AWS EMR with Spark 1.6.2 (maybe 1.6.1), using this code:
%pyspark
from pyspark.sql import SQLContext, Row
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
#Load the data
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = sqlContext.read.format('jdbc').options(url='jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password', dbtable=aquery).load()
dfMinDates.show()
and it worked. That was summer of 2016.
I haven't had need of it since then and now AWS has Spark 2.0.
The new syntax is
myDF = spark.read.jdbc like this:
%pyspark
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = spark.read.jdbc("jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password", dbtable=aquery).load()
dfMinDates.show()
but I get this error:
Py4JJavaError: An error occurred while calling o119.jdbc. :
java.sql.SQLException: No suitable driver at
java.sql.DriverManager.getDriver(DriverManager.java:315) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:211) at
java.lang.Thread.run(Thread.java:745) (, Py4JJavaError(u'An error occurred
while calling o119.jdbc.\n', JavaObject id=o121), )
I researched the Spark 2.0 documentation, and found this:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
I don't know how to implement this and did more reading from various posts, some blogs and some posts in stackoverflow and found this:
spark.driver.extraClassPath = org.postgresql.Driver
I did this in the Interpreter settings page of Zeppelin, but still I get the same error.
I tried to add a Postgres Interpreter, and I'm not sure I did it right (because I wasn't sure whether to put it in the Spark interpreter or Python interpreter), and I chose the Spark interpreter. Now the Postgres interpreter also has all the same settings as the Spark interpreter, which might not matter, but still I get the same error.
In Spark 1.6, I just don't remember going through all this trouble.
As an experiment, I spun up an EMR cluster with Spark 1.6.2 and tried the old code that used to work, and got the same error as above!
The Zeppelin site has Postgres covered but their information looks like code rather than how to set up the interpreters, so I don't know how to use it.
I'm out of ideas and references.
Any suggestions are much appreciated!

You need to use Amazon's Redshift specific driver. You can download it from here: http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html.
However, if you're using EMR it's already in place (at /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar) and you can just tell Zeppelin where it is.
Here's how to declare it: AWS Redshift driver in Zeppelin

Related

Pyspark in Azure - need to configure sparkContext

Using spark Notebook in Azure Synapse, I'm processing some data from parquet files, and outputting it as different parquet files. I produced a working script and started applying it to different datasets, all working fine until I cam across a dataset containing dates older than 1900.
For this issue, I came across this article (which I took to be applicable to my scenario):
Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
The fix is to add this code chunk, which I did, to the top of my notebook:
%%pyspark
from pyspark import SparkContext
sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)
Unfortunately this generated another error:
Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext. :
java.lang.IllegalStateException: Promise already completed. at
scala.concurrent.Promise.complete(Promise.scala:53) at
scala.concurrent.Promise.complete$(Promise.scala:52) at
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86) at
scala.concurrent.Promise.success$(Promise.scala:86) at
scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$sparkContextInitialized(ApplicationMaster.scala:408)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:910)
at
org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
at org.apache.spark.SparkContext.(SparkContext.scala:683) at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:238) at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748)
I've tried looking into resolutions, but this is getting outside of my area of expertise. I want my Synapse spark notebook to run, even on date fields where the date is less than 1900. Any ideas?
I was able to solve this problem by changing the overall configuration for my spark pool (which you will probably want to do as well, unless you want to add config code to every notebook you make). To do this, open up Synapse Studio, then go Manage > Apache Spark pools, click the three dots by your pool (which will be hidden until you mouse over them, great design Microsoft), then select Apache Spark configuration.
From there, create a new configuration, and add a configuration property. For the property, enter spark.sql.parquet.int96RebaseModeInRead and the value enter CORRECTED. Note that spark.sql.parquet.int96RebaseModeInRead does NOT show up as a suggested property, you have to enter it yourself.
Apply your changes, save everything, and make sure your new configuration is selected. It might take a bit for the new changes to be reflected in your notebooks, but it should work from there. If you notice some funky date issues with older dates, try changing CORRECTED to LEGACY.

HiveException when running a sql example in Spark shell

a newbie in apache spark here! I am using Spark 2.4.0 and Scala version 2.11.12, and I'm trying to run the following code in my spark shell -
import org.apache.spark.sql.SparkSession
import spark.implicits._
var df = spark.read.json("storesales.json")
df.createOrReplaceTempView("storesales")
spark.sql("SELECT * FROM storesales")
And I get the following error -
2018-12-18 07:05:03 WARN Hive:168 - Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.
hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java
:62)
I also saw this Issues trying out example in Spark-shell and as per the accepted answer, I have tried to start my spark shell like so,
~/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --conf spark.sql.warehouse.dir=file:///tmp/spark-warehouse, however, it did not help and the issue persists.

scala spark raises an error related to derby everytime when doing toDF() or createDataFrame

I am new to scala and scala-api spark and I tried scala-api spark recently on my own computer, which means I run the spark locally by setting SparkSession.builder().master("local[*]"). at first I succeeded in reading the text file using spark.sparkContext.textFile(). After having got the corresponding rdd, I tried convert the rdd to a spark DataFrame, but failed again and again.
To be specific, I used two methods, 1) toDF() and 2) spark.createDataFrame(), all failed, both two methods gave me similar error as shown below.
2018-10-16 21:14:27 ERROR Schema:125 - Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#199549a5, see the next exception for details.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
I examined the error message, it seems that the errors are related to apache.derby and some connection to some database is failed. I do not know what JDBC is actually. I am somewhat familiar with pyspark and I have never been asked to configure any JDBC database, WHY SCALA-API SPARK need it? what should I do to avoid this error? why scala-api spark dataframe need JDBC or any database while scala-api spark RDD doesn't?
For future googler:
I have googled for several hours and still have no idea about how to get rid of this error. But the origin of this problem is very clear: my sparksession enables the support for Hive which then need to specify the database. To solve this problem, we need to disable the support for Hive, since I am running spark on my own mac, it is ok to do this.
So I download the spark source file and build it by myself using the command
./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests
omits -Phive -Phive-thriftserver.
I tested self-built spark, and metastore_db folder has never been created and so fat so good.
For the detail, please refer to this post: Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

PySpark: java.sql.SQLException: No suitable driver

I have spark code which connects to Netezza and reads a table.
conf = SparkConf().setAppName("app").setMaster("yarn-client")
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
nz_df=hc.load(source="jdbc",url=address dbname";username=;password=",dbtable="")
I do spark-submit and run the code in the following way..
spark-submit -jars nzjdbc.jar filename.py
And I get the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o55.load.
: java.sql.SQLException: No suitable driver
Am I doing anything wrong over here?? is the jar not suitable or is it not able to recgonize the jar?? please let me know the correct way if this is not and also can anyone provide the link to get the jar for connecting netezza from spark.
I am using the 1.6.0 version of spark.

Amazon EMR w/ Spark w/ Postgres: "Failed to start database 'metastore_db'"

I've used Apache Spark with the PostgreSQL JDBC driver on my own Linux servers before without issues, but I can't get it to work on Amazon EMR doing it the same way.
I first downloaded the Postgres driver and set up my pyspark classpath this way: Adding postgresql jar though spark-submit on amazon EMR
I executed the following in pyspark on an Amazon EMR instance set up with Spark, similarly to how I usually do it on my own server. "myhost" is the hostname of my Amazon RDS instance running PostgreSQL, which I am able to connect to from my EMR instance with psql, so I know it should work:
# helper, gets RDD from database
def get_db_rdd(table, lower=0, upper=1000):
db_connection = {
"host": "myhost",
"port": 5432,
"database": "mydb",
"user": "postgres",
"password": "mypassword"
}
url = "jdbc:postgresql://{}:{}/{}?user={}".format(db_connection["host"],
db_connection["port"],
db_connection["database"],
db_connection["user"])
ret = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", table) \
.option("partitionColumn", "id") \
.option("numPartitions", 1024) \
.option("lowerBound", lower) \
.option("upperBound", upper) \
.option("password", db_connection["password"]) \
.load()
ret = ret.rdd
return ret
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd = get_db_rdd("test", 0, 3) # table exists, has columns (`id bigserial, string text`)
I immediately get a crash with this exception:
17/04/21 19:34:07 ERROR Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#3aa157b0, see the next exception for details.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
[...]
Looking around online, this has to do with Apache Hive... No idea why that's involved here, but I may be misunderstanding. I do see metastore_db in my home dir. All the proposed solutions involve editing some Hive configuration that I don't even have on my instance or creating that dir I already have. My EMR instance has totally default settings. Could someone more familiar with this environment point me in the right direction?
Edit: I don't have the entire stack trace handy but have some left in my GNU screen. Here's more, mentions Derby:
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#3aa157b0, see the next exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
... 113 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /home/hadoop/metastore_db.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)
Edit 2: Using other RDDs like the following works: sc.parallelize([1, 2, 3]).map(lambda r: r * 2).collect(). The problem is only for RDDs connected to Postgres.
Edit 3:
>>> spark.range(5).show()
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
The error message:
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /home/hadoop/metastore_db.
tells us that the embedded, one-thread Derby instance is already in use. I'm not very familiar with Hive, but is used when Spark boots Hive-enabled SparkSession that you can see in your stack trace:
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:192)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:366)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:270)
at org.apache.spark.sql.hive.HiveExternalCatalog.<init>(HiveExternalCatalog.scala:65)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166)
at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:86)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)
at org.apache.spark.sql.internal.SessionState.<init>(SessionState.scala:157)
at org.apache.spark.sql.hive.HiveSessionState.<init>(HiveSessionState.scala:32)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:549)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:605)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:516)
I copied the most relevant lines (to remove the noise).
Side note: You don't really need Hive features these days since Spark supports most natively (and in Spark 2.2 most Hive "infrastructure" will get away).
As you can see in the stack trace, the multiple-threads-accessing-single-threaded-Derby exception will only be thrown when you use SparkSession which is the entry point to Spark SQL.
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:549)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:605)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:516)
That's why you don't see it when working with RDD API. The RDD API does not use Hive at all.
Read up Hive's official documentation at Local/Embedded Metastore Database (Derby).
Thanks to suggestions from Jacek about the nature of my problem, I developed a hack workaround after some trial and error. Haven't been able to really solve the problem yet, but this works, and that's good enough for me. I'll report back if I run into problems later.
Start pyspark with the Postgres driver as normal:pyspark --driver-class-path=/home/hadoop/postgres_driver.jar --jars=/home/hadoop/postgres_driver.jar
While that's open (!), in a separate SSH session, cd to home and mv metastore_db old_metastore_db (or you can do this in pyspark with os.system()). The point of this is to release the lock on the metastore that Spark creates by default; Spark will recreate the directory without a lock.
Try to create an RDD connected to Postgres the way I described in my question. It'll give an error about "no suitable driver." For some reason, the driver wasn't loaded. But after that error, it seems the driver is actually loaded now.
mv metastore_db old_metastore_db2, for similar reasons to above. I guess another Hive session is connected now and needs its lock to be cleared out.
Create the Postgres-connected RDD again, same way. Driver is loaded, and metastore is unlocked, and it seems to work. I can fetch from my tables, perform RDD operations, and collect().
Yes, I know this is very dirty. Use at your own risk.