Spark Data writing in Delta format - pyspark

Spark Version: 3.2.1
Delta version: 1.2.1 (tried 2.0 version as well)
While I am trying to run the getting started code to try out "delta".
from pyspark.sql import SparkSession
from delta import *
builder = SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
I am getting below error:
"name": "Py4JJavaError",
"message": "An error occurred while calling o201.showString.\n: org.apache.spark.SparkException: Cannot find catalog plugin class for catalog 'spark_catalog'
Can anyone please help me understand the issue to resolve it?
Thanks in Advance.

Not sure which environment and mode are you using, but in general you need to add your jar by using the config spark.jars.packages because delta lake jar is not in Spark default jar. For example .config("spark.jars.packages", "io.delta:delta-core_2.12:1.2.0")

Related

NoSuchMethodError in google dataproc cluster for excel files

While consuming Excel file in dataproc cluster, getting errorjava.lang.NoSuchMethodError.
Note: schema is getting printed but not the actual data.
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling
o74.showString. : java.lang.NoSuchMethodError:
scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at
com.crealytics.spark.excel.ExcelRelation.buildScan(ExcelRelation.scala:74)
Code:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from google.cloud import storage
from google.cloud import bigquery
import pyspark
client = storage.Client()
bucket_name = "test_bucket"
path=f"gs://{bucket_name}/test_file.xlsx"
def make_spark_session(app_name, jars=[]):
configuration = (SparkConf()
.set("spark.jars", ','.join(jars)))
spark = SparkSession.builder.appName(app_name) \
.config(conf=configuration).getOrCreate()
return spark
app_name = 'test_app'
jars = ['gs://bucket/spark-excel_2.11_uber-0.12.0.jar']
spark = make_spark_session(app_name,jars)
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader","true") \
.load(path)
df.show(1)
This appears to be Scala version mismatch between your job jars and the cluster. Both Dataproc 1.5 and 2.0 come with Scala 2.12. The gs://bucket/spark-excel_2.11_uber-0.12.0.jar in your code seems to be Scala 2.11 based, you might want to use spark-excel_2.12_... instead. In addition to that, make sure your Spark application is also built with Scala 2.12.

mongodb pyspark connector set up

i have followed the link here to install, build is succesful but I cannot find the connector.
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
Py4JJavaError: An error occurred while calling o592.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource
the connector was downloaded and built here
https://github.com/mongodb/mongo-spark#please-see-the-downloading-instructions-for-information-on-getting-and-using-the-mongodb-spark-connector
I Am using ubuntu 20.04
Change to
df = spark.read.format("mongodb").load()
Then, you have to tell pyspark where to find the mongo libs, e.g.
/usr/local/bin/spark-submit --jars $HOME/java/lib/mongo-spark-connector-10.0.0.jar,$HOME/java/lib/mongodb-driver-sync-4.3.2.jar,$HOME/java/lib/mongodb-driver-core-4.3.2.jar,$HOME/java/lib/bson-4.3.2.jar mongo_spark1.py
I'm running pyspark in local mode.
Mongodb version 4
Spark version 3.2.1
I download all needed jars in one folder(path_to_jars) and add it to spark config
bson-4.7.0.jar
mongodb-driver-legacy-4.7.0.jar
mongo-spark-connector-10.0.3.jar
mongodb-driver-core-4.7.0.jar
mongodb-driver-sync-4.7.0.jar
from pyspark.sql import SparkSession
url = 'mongodb://id:port/Database.collection'
spark = (SparkSession
.builder
.master('local[*]')
.config('spark.driver.extraClassPath','path_to_jars/*')
.config("spark.mongodb.read.connection.uri",url)
.config("spark.mongodb.write.connection.uri", url)
.getOrCreate()
)
df = spark.read.format("mongodb").load()

How to connect Snowflake with PySpark?

I am trying to connect to Snowflake with Pyspark on my local machine.
My code is as follows:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.jars','/path/to/driver/snowflake-jdbc-3.12.17.jar , \
/path/to/connector/spark-snowflake_2.12-2.10.0-spark_3.2.jar')
spark = SparkSession.builder \
.master("local") \
.appName("snowflake-test") \
.config(conf=conf) \
.getOrCreate()
sfOptions = {
"sfURL": "https://someurl.com",
"sfAccount": "account",
"sfUser": "user",
"sfPassword": "password",
"sfDatabase": "database",
"sfSchema": "PUBLIC",
"sfWarehouse": "warehouse"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select * DimDate") \
.load()
df.show()
When I run this I get the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o46.load.
How to fix this?
With the Snowflake Spark JAR version "spark-snowflake_2.12:2.10.0-spark_3.2"
Snowflake JDBC 3.13.14 needs to be used. I see that you are using 3.12.17 JDBC version.
Can you add JDBC Version 3.13.14 and then test. As pointed by FKyani, this is a compatibility issue between Snowflake-Spark Jar and JDBC jar.
Please confirm the correct JDBC version is imported.
Recommended Client Versions: https://docs.snowflake.com/en/release-notes/requirements.html#recommended-client-versions
This looks similar to the error mentioned in this article:
https://community.snowflake.com/s/article/Error-py4j-protocol-Py4JJavaError-An-error-occurred-while-calling-o74-load-java-lang-NoSUchMethodError-scala-Product-init-Lscala-Product-V
If you are using Scala 2.12, you need to downgrade it to 2.11. Please note that, in this case, you will also have to use the associated version of the Spark connector for Snowflake.
Spark connectors for Snowflake can be found here. We recommend you to use the latest connector version depending on your Spark version and Scala 2.11.

how to connect to mongodb Atlas from databricks cluster using pyspark

how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing
I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.

Spark cannot find SQL server jdbc driver even if both the mssql.jar and the .dll are present in the path

I am trying to connect Spark to a SQL server using this:
#Myscript.py
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example").config("spark.driver.extraClassPath","/home/mssql-jdbc-9.2.1.jre15.jar:/home/sqljdbc_auth.dll")\
.getOrCreate()
sqldb = spark.read \
.format("jdbc") \
.option("url", "jdbc:sqlserver://server:5150;databaseName=testdb;integratedSecurity=true") \
.option("dbtable", "test_tbl") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
sqldb.select('coldate').show()
I have made sure that both the .dll and the .jar is under /home folder. I call it like so:
spark-submit --jars /home/sqljdbc41.jar MyScript.py
py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is not configured for integrated authentication. ClientConnectionId:3462d79d-c165-4607-9790-67a2c786a9cf
Seems like it cannot find the .dll file? I ahve verified it exists under /home.
This error was resolved when I placed the sqljdbc_auth.dll file in "C:\Windows\System32" folder.
For those who want to know where to find this dll file, you may:
Download the JDBC Driver for SQL Server (sqljdbc_6.0.8112.200_enu.exe) from the Microsoft website below:
https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
Unzip it and navigate as follows:
\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\auth\x64