I am trying to connect to Snowflake with Pyspark on my local machine.
My code is as follows:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.jars','/path/to/driver/snowflake-jdbc-3.12.17.jar , \
/path/to/connector/spark-snowflake_2.12-2.10.0-spark_3.2.jar')
spark = SparkSession.builder \
.master("local") \
.appName("snowflake-test") \
.config(conf=conf) \
.getOrCreate()
sfOptions = {
"sfURL": "https://someurl.com",
"sfAccount": "account",
"sfUser": "user",
"sfPassword": "password",
"sfDatabase": "database",
"sfSchema": "PUBLIC",
"sfWarehouse": "warehouse"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * DimDate") \
.load()
df.show()
When I run this I get the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o46.load.
How to fix this?
With the Snowflake Spark JAR version "spark-snowflake_2.12:2.10.0-spark_3.2"
Snowflake JDBC 3.13.14 needs to be used. I see that you are using 3.12.17 JDBC version.
Can you add JDBC Version 3.13.14 and then test. As pointed by FKyani, this is a compatibility issue between Snowflake-Spark Jar and JDBC jar.
Please confirm the correct JDBC version is imported.
Recommended Client Versions: https://docs.snowflake.com/en/release-notes/requirements.html#recommended-client-versions
This looks similar to the error mentioned in this article:
https://community.snowflake.com/s/article/Error-py4j-protocol-Py4JJavaError-An-error-occurred-while-calling-o74-load-java-lang-NoSUchMethodError-scala-Product-init-Lscala-Product-V
If you are using Scala 2.12, you need to downgrade it to 2.11. Please note that, in this case, you will also have to use the associated version of the Spark connector for Snowflake.
Spark connectors for Snowflake can be found here. We recommend you to use the latest connector version depending on your Spark version and Scala 2.11.
Related
Spark Version: 3.2.1
Delta version: 1.2.1 (tried 2.0 version as well)
While I am trying to run the getting started code to try out "delta".
from pyspark.sql import SparkSession
from delta import *
builder = SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
I am getting below error:
"name": "Py4JJavaError",
"message": "An error occurred while calling o201.showString.\n: org.apache.spark.SparkException: Cannot find catalog plugin class for catalog 'spark_catalog'
Can anyone please help me understand the issue to resolve it?
Thanks in Advance.
Not sure which environment and mode are you using, but in general you need to add your jar by using the config spark.jars.packages because delta lake jar is not in Spark default jar. For example .config("spark.jars.packages", "io.delta:delta-core_2.12:1.2.0")
i have followed the link here to install, build is succesful but I cannot find the connector.
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
Py4JJavaError: An error occurred while calling o592.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource
the connector was downloaded and built here
https://github.com/mongodb/mongo-spark#please-see-the-downloading-instructions-for-information-on-getting-and-using-the-mongodb-spark-connector
I Am using ubuntu 20.04
Change to
df = spark.read.format("mongodb").load()
Then, you have to tell pyspark where to find the mongo libs, e.g.
/usr/local/bin/spark-submit --jars $HOME/java/lib/mongo-spark-connector-10.0.0.jar,$HOME/java/lib/mongodb-driver-sync-4.3.2.jar,$HOME/java/lib/mongodb-driver-core-4.3.2.jar,$HOME/java/lib/bson-4.3.2.jar mongo_spark1.py
I'm running pyspark in local mode.
Mongodb version 4
Spark version 3.2.1
I download all needed jars in one folder(path_to_jars) and add it to spark config
bson-4.7.0.jar
mongodb-driver-legacy-4.7.0.jar
mongo-spark-connector-10.0.3.jar
mongodb-driver-core-4.7.0.jar
mongodb-driver-sync-4.7.0.jar
from pyspark.sql import SparkSession
url = 'mongodb://id:port/Database.collection'
spark = (SparkSession
.builder
.master('local[*]')
.config('spark.driver.extraClassPath','path_to_jars/*')
.config("spark.mongodb.read.connection.uri",url)
.config("spark.mongodb.write.connection.uri", url)
.getOrCreate()
)
df = spark.read.format("mongodb").load()
how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing
I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.
I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.
My code is:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark = SparkSession \
.builder \
.appName("SparkSQL") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
.getOrCreate()
df = spark.read.format("mongo").load()
When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:
java.lang.NoClassDefFoundError: org/bson/conversions/Bson
I am very new to Spark. Could someone please help me understand how to install the missing Bson reference?
the following error returned when I am connecting MongoDB with PySpark in PyCharm.
"java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html"
Python: 3.6.0
Spark: 2.2.0
Mongo-Spark connector: mongo-spark-connector_2.11-2.2.0.jar
code is as followers:
spark = SparkSession.builder.appName("Python Spark SQL basic example") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/local.users") \
.getOrCreate()
spark.conf.set("spark.jars", "/ExternalJar/mongo-spark-connector_2.11-2.2.0.jar")
df_users = spark.read.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb://127.0.0.1/local.users")\
.load()