Spark-Snowflake connection failure - scala

I'm trying to connect to Snowflake via Spark and read data. Utilizing below two approaches to achieve the same result:
#Shell Env Variable
export COMM_LIBS=/home/user/libs
Failing to connect to Snowflake==>
spark-shell --driver-memory 5g --num-executors 3 --executor-cores 4 --executor-memory 4g --deploy-mode client --queue queueName --jars ${COMM_LIBS}/snowflake-jdbc-3.13.6.jar,${COMM_LIBS}/spark-snowflake_2.11-2.9.1-spark_2.4.jar
import scala.collection.Map
val SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
import net.snowflake.spark.snowflake.Utils.SNOWFLAKE_SOURCE_NAME
import net.snowflake.spark.snowflake._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf()
.setMaster("yarn").set("spark.executor.extraJavaOptions","-Djavax.net.ssl.trustStore=/certpath -Djavax.net.ssl.trustStorePassword=password")
.set("spark.driver.extraJavaOptions","-Djavax.net.ssl.trustStore=/certPath -Djavax.net.ssl.trustStorePassword=password")
val spark = new SparkSession.Builder().config(sparkConf).enableHiveSupport().getOrCreate()
val sfOptions = Map(
"sfURL" -> "URL",
"sfAccount" -> "ACCOUNT",
"sfUser" -> "USER",
"sfPassword" -> "PASSWORD",
"sfDatabase" -> "SF_DB",
"sfSchema" -> "SF_SCHEMA",
"sfWarehouse" -> "SF_WAREHOUSE",
"sfRole" -> "SF_ROLE",
"tracing" -> "all"
)
SnowflakeConnectorUtils.disablePushdownSession(spark)
spark.read.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("query", "select count(1) from table")
.option("autopushdown", "off")
.load().show()
Succeeding in making connection with Snowflake==>
/opt/mapr/spark/spark-2.4.4/bin/spark-shell -v --driver-memory 4g --master yarn --deploy-mode client --num-executors 4 --executor-cores 3 --executor-memory 1g --queue queueName --conf "spark.yarn.executor.memoryOverhead=4096" --jars ${COMM_LIBS}/spark-snowflake_2.11-2.9.1-spark_2.4.jar,${COMM_LIBS}/snowflake-jdbc-3.13.6.jar --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=/certpath -Djavax.net.ssl.trustStorePassword=password" --conf "spark.driver.extraJavaOptions=-Djavax.net.ssl.trustStore=/certpath -Djavax.net.ssl.trustStorePassword=password"
import scala.collection.Map
val SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
import net.snowflake.spark.snowflake.Utils.SNOWFLAKE_SOURCE_NAME
import net.snowflake.spark.snowflake._
val sfOptions = Map(
"sfURL" -> "URL",
"sfAccount" -> "ACCOUNT",
"sfUser" -> "USER",
"sfPassword" -> "PASSWORD",
"sfDatabase" -> "SF_DB",
"sfSchema" -> "SF_SCHEMA",
"sfWarehouse" -> "SF_WAREHOUSE",
"sfRole" -> "SF_ROLE",
"tracing" -> "all"
)
SnowflakeConnectorUtils.disablePushdownSession(spark)
spark.read.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("query", "select count(1) from table")
.option("autopushdown", "off")
.load().show()
In this case, integration with Snowflake requires passing truststore certificate, which is what I'm passing to Spark Session but in two different ways. Passing configuration programmatically in the first piece of code, which is failing due to the error Caused by: sun.security.validator.ValidatorException: No trusted certificate found . However, passing the same truststore cert as a configuration parameter to the spark-shell in the second piece of code, renders expected result. Since in both the approaches I'm passing truststore, not sure as to why spark is unable to detect truststore certs in the first version but in the second one. Requesting the community to share some insights and help with this issue.

Related

Spark Streaming Kafka Timeout

I am trying to run Spark + Kafka integration on Amazon EMR with a simple example with spark-shell but i keep getting time out errors. However, when I publish with org.apache.kafka and same settings as below it works without failure.
Time out error:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
I moved client.truststore.jks and client.keystore.p12 to hdfs and ran the below
$ spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0
import org.apache.spark.sql.functions.col
val kafkaOptions = Map("kafka.bootstrap.servers" -> s"$host:$port",
"kafka.security.protocol" -> "SSL",
"kafka.ssl.endpoint.identification.algorithm" -> "",
"kafka.ssl.truststore.location" -> "/home/hadoop/client.truststore.jks",
"kafka.ssl.truststore.password" -> "password",
"kafka.ssl.keystore.type" -> "PKCS12",
"kafka.ssl.key.password" -> "password",
"kafka.ssl.keystore.location" -> "/home/hadoop/client.keystore.p12",
"kafka.ssl.keystore.password" -> "password")
)
val df = spark
.read
.option("header", true)
.option("escape", "\"")
.csv("s3://bucket/file.csv")
val publishToKafkaDf = df.withColumn("value", col("body"))
publishToKafkaDf
.selectExpr( "CAST(value AS STRING)")
.write
.format("kafka")
.option("topic", "test-topic")
.options(kafkaOptions)
.save()
Solved, it was an AWS security group outbound issue with worker nodes

Classnotfound error when connecting to snowflake from pyspark local machine

I am trying to connect to snowflake from Pyspark on my local machine.
My code looks as below.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
sc = SparkContext("local", "sf_test")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('sf_test')
sfOptions = {
"sfURL" : "someaccount.some.address",
"sfAccount" : "someaccount",
"sfUser" : "someuser",
"sfPassword" : "somepassword",
"sfDatabase" : "somedb",
"sfSchema" : "someschema",
"sfWarehouse" : "somedw",
"sfRole" : "somerole",
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
I get an error when I run this particular chunk of code.
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query","""select * from
"PRED_ORDER_DEV"."SALES"."V_PosAnalysis" pos
ORDER BY pos."SAPAccountNumber", pos."SAPMaterialNumber" """).load()
Py4JJavaError: An error occurred while calling o115.load. :
java.lang.ClassNotFoundException: Failed to find data source:
net.snowflake.spark.snowflake. Please find packages at
http://spark.apache.org/third-party-projects.html at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
I have loaded the connector and jdbc jar files and added them to CLASSPATH
pyspark --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4
CLASSPATH = C:\Program Files\Java\jre1.8.0_241\bin;C:\snowflake_jar
I want to be able to connect to snowflake and read data with Pyspark. Any help would be much appreciated!
To run a pyspark application you can use spark-submit and pass the JARs under the --packages option. I'm assuming you'd like to run client mode so you pass this to the --deploy-mode option and at last you add the name of your pyspark program.
Something like below:
spark-submit --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4 --deploy-mode client spark-snowflake.py
Below working script.
You should to create directory jar in root of you project and add two jars:
snowflake-jdbc-3.13.4.jar (jdbc driver)
spark-snowflake_2.12-2.9.0-spark_3.1.jar (spark connector).
Next you need to understood what is your scala compiler version. I`m using PyCharm, so double click shift and the search for 'scala'. You will see something like scala-compiler-2.12.10.jar. The first digits of the scala-compiler version (in our case 2.12) should be the same as the first digits of spark connector (spark-snowflake_2.12-2.9.0-spark_3.1.jar)
Driver - https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/
Connector - https://docs.snowflake.com/en/user-guide/spark-connector-install.html#downloading-and-installing-the-connector
CHECK SCALA COMPILER VERSION BEFORE DOWNLOADING CONNECTOR
from pyspark.sql import SparkSession
sfOptions = {
"sfURL": "sfURL",
"sfUser": "sfUser",
"sfPassword": "sfPassword",
"sfDatabase": "sfDatabase",
"sfSchema": "sfSchema",
"sfWarehouse": "sfWarehouse",
"sfRole": "sfRole",
}
spark = SparkSession.builder \
.master("local") \
.appName("snowflake-test") \
.config('spark.jars', 'jar/snowflake-jdbc-3.13.4.jar,jar/spark-snowflake_2.12-2.9.0-spark_3.1.jar') \
.getOrCreate()
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * from some_table") \
.load()
df.show()

Issues in passing application configuration parameters to spark application

I created an object using Spark/Scala to load data from an Oracle source to Hive Table. Database Password is passed through application.properties through typesafe.ConfigFactory.
I ATTEMPTED APPLICATION.CONF IN MY USER FOLDER and also IN CLASSPATH IN ANOTHER ATTEMPT with below spark-submit.
On every attempt I encounter error saying "java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':".
which indicates like properties not reached configFactory methods
Can someone help me what am missing?
//my object snippet
object LoadFromOracleToHive {
def SaveToHive(spark :SparkSession):Unit = {
try {
val appConf = ConfigFactory.load(s"application.conf").getConfig("my.config")
val sparkConfig = appConf.getConfig("spark") // config.getConfig("spark")
val df = spark
.read
.format("jdbc")
.options(Map("password" -> sparkConfig.getString("password") , "driver" -> "oracle.jdbc.driver.OracleDriver"))
//my application.conf
my.config {
spark {
password = "password"
}
}
//my spark-submit
spark-submit --class LoadFromOracleToHive \
--master yarn \
--deploy-mode client \
--driver-memory 4g \
--executor-memory 8g \
--num-executors 15 \
--executor-cores 5 \
--conf spark.kryoserializer.buffer.max=512m \
--queue csg \
--jars /home/myuserfolder/ojdbc7.jar /home/myuserfolder/SandeepTest-1.0-SNAPSHOT-jar-with-dependencies.jar \
--queue /home/myuserfolder/application.conf \
--conf spark.driver.extraClassPath=-Dconfig.file=/home/myuserfolder/application.conf \
--conf spark.executor.extraClassPath=-Dconfig.file=/home/myuserfolder/application.conf

Hive table in Spark

I am running following job in HDP.
export SPARK-MAJOR-VERSION=2
spark-submit --class com.spark.sparkexamples.Audit --master yarn --deploy-mode cluster \
--files /bigdata/datalake/app/config/metadata.csv BRNSAUDIT_v4.jar dl_raw.ACC /bigdatahdfs/landing/AUDIT/BW/2017/02/27/ACC_hash_total_and_count_20170227.dat TH 20170227
Its failing with error that:
*Table or view not found: `dl_raw`.`ACC`; line 1 pos 94;
'Aggregate [count(1) AS rec_cnt#58L, 'count('BRCH_NUM) AS hashcount#59, 'sum('ACC_NUM) AS hashsum#60]
+- 'Filter (('trim('country_code) = trim(TH)) && ('from_unixtime('unix_timestamp('substr('bus_date, 0, 11), MM/dd/yyyy), yyyyMMdd) = 20170227))
+- 'UnresolvedRelation `dl_raw`.`ACC'*
Whereas table is present in Hive and it is accessible from spark-shell.
UPD.
val sparkSession = SparkSession.builder
.appName("spark session example")
.enableHiveSupport()
.getOrCreate()
sparkSession.conf.set("spark.sql.crossJoin.enabled", "true")
val df_table_stats = sparkSession.sql("""select count(*) as rec_cnt,count(distinct BRCH_NUM) as hashcount,
sum(ACC_NUM) as hashsum
from dl_raw.ACC
where trim(country_code) = trim('BW')
and from_unixtime(unix_timestamp(substr(bus_date,0,11),'MM/dd/yy‌​yy'),'yyyyMMdd')='20‌​170227'""")
include the hive-site.xml file in the --file parameter while submitting the job

Using pyspark to connect to PostgreSQL

I am trying to connect to a database with pyspark and I am using the following code:
sqlctx = SQLContext(sc)
df = sqlctx.load(
url = "jdbc:postgresql://[hostname]/[database]",
dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
password = "MichaelJordan",
user = "ScottyPippen",
source = "jdbc",
driver = "org.postgresql.Driver"
)
and I am getting the following error:
Any idea why is this happening?
Edit: I am trying to run the code locally in my computer.
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download/
Then replace the database configuration values by yours.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
df.printSchema()
More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
The following worked for me with postgres on localhost:
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.
For the pyspark shell you use the SPARK_CLASSPATH environment variable:
$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark
For submitting a script via spark-submit use the --driver-class-path flag:
$ spark-submit --driver-class-path /path/to/downloaded/jar script.py
In the python script load the tables as a DataFrame as follows:
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tablename', properties=properties
)
or alternatively:
df = sqlContext.read.format('jdbc').\
options(url='jdbc:%s' % url, dbtable='tablename').\
load()
Note that when submitting the script via spark-submit, you need to define the sqlContext.
It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars
Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
and work fine in pyspark console and jupyter
You normally need either:
to install the Postgres Driver on your cluster,
to provide the Postgres driver jar from your client with the --jars option
or to provide the maven coordinates of the Postgres driver with --packages option.
If you detail how are you launching pyspark, we may give you more details.
Some clues/ideas:
spark-cannot-find-the-postgres-jdbc-driver
Not able to connect to postgres using jdbc in pyspark shell
One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.
This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:
/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
--packages org.postgresql:postgresql:9.4.1211\
--driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
--master local[4] main.py
And in main.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe = spark.read.format('jdbc').options(
url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
database='my_db',
dbtable='my_table'
).load()
dataframe.show()
To use pyspark and jupyter notebook notebook: first open pyspark with
pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar
Then in jupyter notebook
import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
I had trouble to get a connection to the postgresDB with the jars i had on my computer.
This code solved my problem with the driver
from pyspark.sql import SparkSession
import os
sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", sparkClassPath) \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
.option("driver", "org.postgresql.Driver") \
.option("dbtable", "yourtablename") \
.option("user", "postgres") \
.option("password", "***") \
.load()
df.show()
I also get this error
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(Unknown Source)
and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') in SparkSession - that worked.
eg:
from pyspark import SparkContext, SparkConf
import os
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.appName('Python Spark Postgresql') \
.config("spark.jars", "./postgresql-42.2.18.jar") \
.config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/abc") \
.option("dbtable", 'tablename') \
.option("user", "postgres") \
.option("password", "1") \
.load()
df.printSchema()
This exception means jdbc driver does not in driver classpath.
you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.
Download postgresql jar from here:
Add this to ~Spark/jars/ folder.
Restart your kernel.
It should work.
Just initialize pyspark with --jars <path/to/your/jdbc.jar>
E.g.: pyspark --jars /path/Downloads/postgresql-42.2.16.jar
then create a dataframe as suggested above in other answers
E.g.:
df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()
Download postgres JDBC driver from https://jdbc.postgresql.org/download.html
and use the script below.
Changes to make:
Edit PATH_TO_JAR_FILE
Save your DB credentials in an environment file and load them
Query the DB using query option and limit using fetch size
import os
from pyspark.sql import SparkSession
PATH_TO_JAR_FILE = "/home/user/Downloads/postgresql-42.3.3.jar"
spark = SparkSession \
.builder \
.appName("Example") \
.config("spark.jars", PATH_TO_JAR_FILE) \
.getOrCreate()
DB_HOST = os.environ.get("PG_HOST")
DB_PORT = os.environ.get("PG_PORT")
DB_NAME = os.environ.get("PG_DB_CLEAN")
DB_PASSWORD = os.environ.get("PG_PASSWORD")
DB_USER = os.environ.get("PG_USERNAME")
df = spark.read \
.format("jdbc") \
.option("url", f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}") \
.option("user", DB_USER) \
.option("password", DB_PASSWORD) \
.option("driver", "org.postgresql.Driver") \
.option("query","select * from your_table") \
.option('fetchsize',"1000") \
.load()
df.printSchema()