Spark-Snwowflake Connection Errors - scala

Here are the versions I am using:
Spark - 3.0.1
Scala - 2.12.13
Python - 3.7.6
I am having issues running the below code. This is the basic connection to Snowflake via PySpark.
Here is my code:
# Spark imports
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
#
spark = SparkSession \
.builder \
.appName("Pyspark-Snowflake") \
.config('spark.jars','/Users/hana/spark-sf/snowflake-jdbc-3.12.1.jar,/Users/hana/spark-sf/spark-snowflake_2.11-2.8.1-spark_2.4.jar') \
.getOrCreate()
# Set options below
sfOptions = {
"sfURL" : "XXX",
"sfUser" : "XXX",
"sfPassword" : "XXX",
"sfRole": "XXX",
"sfDatabase" : "XXX",
"sfSchema" : "XXX",
"sfWarehouse" : "XXX"
}
# Set Snowflake source
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
# Read from Snowflake
#import net.snowflake.spark.snowflake.Utils.SNOWFLAKE_SOURCE_NAME
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * from TABLE limit 100") \
.load()
df.show()
And here is the error I am getting (in Spyder):
Py4JJavaError: An error occurred while calling o40.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at net.snowflake.spark.snowflake.Parameters$MergedParameters.<init>(Parameters.scala:294)
at net.snowflake.spark.snowflake.Parameters$.mergeParameters(Parameters.scala:288)
at net.snowflake.spark.snowflake.DefaultSource.createRelation(DefaultSource.scala:59)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:221)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
20 more
What is wrong in my code / versions? I've tried multiple JDC versions and continue to get errors. Thank you in advance!

I can see from your spark.jars config that you are using the spark snowflake connector for version 2.4. Either re-run with spark version 2.4 installed.
pip install pyspark==2.4.4
Or use the jar file which is specific to spark snowflake connections for spark 3.0.
The naming convention of which to download can be found here: https://docs.snowflake.com/en/user-guide/spark-connector-install.html

It seems like you are using incorrect spark-snowflake jar version.
The naming convention of spark-snowflake jar represents every detail of what is supports.
For eg. spark-snowflake_2.11-2.8.1-spark_2.4.jar
This jar is supported for spark version 2.4 and Scala and version 2.11.
Please check the Spark and Scala version present in your system and use/download appropriate spark-snowflake jar version from maven repo

Related

I couldn't do spark-submit using mobaxterm on Apache Ambari using intellij, are there any suggestions?

I have written the code in intellij and created a jar file. Then i have used mobaxterm to upload the jar file onto apache ambari files under /user/maria_dev. Finally i have written spark submit command on mobaxterm as "spark-submit --class csv_parquet /user/maria_dev/first_2.12-0.1.jar yarn /user/maria_dev/fileupload/business-operations-survey-2020-covid-19-csv.csv /user/maria_dev/output" and im getting the following error
SPARK_MAJOR_VERSION is set to 2, using Spark2
Warning: Local jar /user/maria_dev/first_2.12-0.1.jar does not exist, skipping.
java.lang.ClassNotFoundException: csv_parquet
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:239)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:861)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/06/28 19:51:42 INFO ShutdownHookManager: Shutdown hook called
21/06/28 19:51:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-02f3e56f-af15-46ab-
b43a-54d17d88f5cc
and my intellij code is:
import org.apache.spark.sql.SparkSession
object csv_parquet {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("csv_parquet").master("yarn").getOrCreate()
val df = spark.read.format("csv")
.option("delimiter",",")
.option("header", "true")
.option("path", "C:\\Users\\kiran\\business-operations-survey-2020-covid-19-
csv.csv").load()
df.show(20)
df.write.mode("overwrite")
.parquet("\\user\\maria_dev\\output")
val par= spark.read.format("parquet").load("\\user\\maria_dev\\output")
par.show()
spark.close()
}
}
I'm getting the output and was able to spark-submit locally but unable to do it on apache ambari.

Load data from redshift using spark ad scala in an EMR

I am trying to connect redshift using spark with scala in zeppelin from an EMR cluster, I used spark-redshift library but it doesn't work. I tried many solutions and i don't know why it gives an error
val df = spark.read .format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://xx:xx/xxxx?user=xxx&password=xxx")
.option("tempdir", path)
.option("query", sql_query) .load() ```
``` java.lang.ClassNotFoundException: Failed to find data source:
com.databricks.spark.redshift. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
... 51 elided
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.redshift.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
... 53 more ```
Should I import something before ? or may be do some configuration
In order to run specific modules within EMR you must add those modules to your cluster.
(They are not there automatically)
Your error is saying that it cannot find the modules.
take a look at
https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

How to specify datasource in spark.read.format when using the data direct jdbc driver of Greenplum (greenplum.jar) to read a greenplum table?

I am trying to read data from a table on Greenplum using spark. I wrote the code as below:
val yearDF = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url", connectionUrl)
.option("server.port","5432")
.option("dbtable", "tablename")
.option("dbschema","schemaname")
.option("user", devUserName)
.option("password", devPassword)
.option("partitionColumn","employeeLoc")
.option("partitions",450)
.load()
.where("period_year=2017 and period_num=12")
.select(gpColSeq map col:_*)
.withColumn(flagCol, lit(0))
I am using greenplum.jar, which provdes the data direct jdbc driver to read data from a greenplum table using Spark.
I am using the below spark-submit command:
SPARK_MAJOR_VERSION=2 spark-submit --class com.partition.source.YearPartition --master=yarn --conf spark.ui.port=4090 --driver-class-path /home/hdpuser/jars/greenplum.jar,/home/hdpuser/jars/postgresql-42.1.4.jar --conf spark.jars=/home/hdpuser/jars/greenplum.jar,/home/hdpuser/jars/postgresql-42.1.4.jar --executor-cores 3 --executor-memory 13G --keytab /home/hdpuser/hdpuser.keytab --principal hdpuser#devuser.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Splinter --conf spark.executor.extraClassPath=/home/hdpuser/jars/greenplum.jar,/home/hdpuser/jars/postgresql-42.1.4.jar splinter_2.11-0.1.jar
When I submit the jar, I see the exception:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: io.pivotal.greenplum.spark.GreenplumRelationProvider. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:553)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:89)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:304)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at com.partition.source.YearPartition$.prepareFinalDF$1(YearPartition.scala:154)
at com.partition.source.YearPartition$.main(YearPartition.scala:181)
at com.partition.source.YearPartition.main(YearPartition.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: io.pivotal.greenplum.spark.GreenplumRelationProvider.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:537)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:537)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:537)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:537)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:537)
I understood that this is due to using io.pivotal.greenplum.spark.GreenplumRelationProvider in the datasource statement i.e.
spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider")
I tried "io.pivotal.greenplum.spark.GreenplumRelationProvider" & "greenplum" but both result in the same exception which is:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source:
io.pivotal.greenplum.spark.GreenplumRelationProvider. Please find
packages at http://spark.apache.org/third-party-projects.html
I am unable to think of what should I give as my datasource in the statement spark.read.format while using the data direct jdbc jar: greenplum.jar
Could anyone let me know how can I fix this problem ?
what version of the greenplum-spark connector are you using?
you should be able to specify the custom jdbc driver in the "driver" option. refer to http://greenplum-spark.docs.pivotal.io/160/using_the_connector.html#use_custom_jdbcdriver.
you can specify the data source as follows:
spark.read.format("greenplum")

Connecting SQLserver jdbc driver to Dataproc cluster

I am working on PySpark application on analyzing Aviation Data. The Database is a MS SQLServer DB. While connecting to the database on the server. I get an error of "No suitable driver". However when I run on local machine with CLI and add JDBC driver jar file to driver-class-path, it runs and connects with DB. But when I try to run on Dataproc cluster, it throws an error of "No suitable driver".
The code snippet is as follows:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import *
spark = SparkSession.builder
.appName('Test')
.getOrCreate()
df = spark.read.format("jdbc").options(
url="jdbc:sqlserver:XYXYXY",
database="data1",
user="YYYY", password="XXXX",
dbtable="db")
.load()
The Error was:
Py4JJavaError: An error occurred while calling o209.load.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there other way to add JDBC jar files to the Dataproc cluster?
Here is a very similar question and answer to it that shows how to add JDBC driver to Spark Driver classpath using gcloud command:
$ gcloud dataproc jobs submit spark ... \
--jars=gs://<BUCKET>/<DIRECTORIES>/<JAR_NAME> \
--properties=spark.driver.extraClassPath=<JAR_NAME>

Connecting from Spark/pyspark to PostgreSQL

I've installed Spark on a Windows machine and want to use it via Spyder. After some troubleshooting the basics seems to work:
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md")
textFile.count()
textFile.filter(lambda line: "Spark" in line).count()
sc.stop()
This runs as expected. I now want to connect to a Postgres9.3 database running on the same server. I have downloaded the JDBC driver from here here and have put it in the folder D:\Analytics\Spark\spark_jars. I've then created a new file D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6\conf\spark-defaults.conf containing this line:
spark.driver.extraClassPath 'D:\\Analytics\\Spark\\spark_jars\\postgresql-9.3-1103.jdbc41.jar'
I've ran the following code to test the connection
import os
os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df = (sqlContext
.load(source="jdbc",
url="jdbc:postgresql://[hostname]/[database]?user=[username]&password=[password]",
dbtable="pubs")
)
sc.stop()
But am getting the following error:
Py4JJavaError: An error occurred while calling o22.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://uklonana01/stonegate?user=analytics&password=pMOe8jyd
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
How can I check whether I've downloaded the right .jar file or where else the error might come from?
I have tried SPARK_CLASSPATH environment variable but it doesn't work with Spark 1.6.
Other answers from posts like below suggested adding pyspark command arguments and it works.
Not able to connect to postgres using jdbc in pyspark shell
Apache Spark : JDBC connection not working
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Remove spark-defaults.conf and add the SPARK_CLASSPATH to the system environment in python like this:
os.environ["SPARK_CLASSPATH"] = 'PATH\\TO\\postgresql-9.3-1101.jdbc41.jar'
Another way to connect pyspark with your postrgresql db.
1) Install spark with pip: pip install pyspark
2) Download last version of jdbc postgresql connector in:
https://jdbc.postgresql.org/download.html
3) Complete this code with your db credentials:
from __future__ import print_function
from pyspark.sql import SparkSession
def jdbc_dataset_example(spark):
df = spark.read \
.jdbc("jdbc:postgresql://[your_db_host]:[your_db_port]/[your_db_name]",
"com_dim_city",
properties={"user": "[your_user]", "password": "[your_password]"})
df.createOrReplaceTempView("[your_table]")
sqlDF = spark.sql("SELECT * FROM [your_table] LIMIT 10")
sqlDF.show()
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.getOrCreate()
jdbc_dataset_example(spark)
spark.stop()
Finally run your aplication with:
spark-submit --driver-class-path /path/to/your_jdbc_jar/postgresql-42.2.6.jar --jars postgresql-42.2.6.jar /path/to/your_jdbc_jar/test_pyspark_to_postgresql.py