how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing
I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.
Related
I am running the below code to pull a bigquery table using Pyspark. The spark session has been initiated without any issue but I am not able to connect to the table in public dataset. Here is the error that I get from running the script.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Optimize BigQuery Storage') \
.config('spark.jars.packages', 'gs://spark-lib/bigquery/spark-3.1-bigquery-0.27.1-preview.jar') \
.getOrCreate()
df = spark.read \
.format("bigquery") \
.load("bigquery-public-data.samples.shakespeare")
https://i.stack.imgur.com/actAv.png
i have followed the link here to install, build is succesful but I cannot find the connector.
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
Py4JJavaError: An error occurred while calling o592.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource
the connector was downloaded and built here
https://github.com/mongodb/mongo-spark#please-see-the-downloading-instructions-for-information-on-getting-and-using-the-mongodb-spark-connector
I Am using ubuntu 20.04
Change to
df = spark.read.format("mongodb").load()
Then, you have to tell pyspark where to find the mongo libs, e.g.
/usr/local/bin/spark-submit --jars $HOME/java/lib/mongo-spark-connector-10.0.0.jar,$HOME/java/lib/mongodb-driver-sync-4.3.2.jar,$HOME/java/lib/mongodb-driver-core-4.3.2.jar,$HOME/java/lib/bson-4.3.2.jar mongo_spark1.py
I'm running pyspark in local mode.
Mongodb version 4
Spark version 3.2.1
I download all needed jars in one folder(path_to_jars) and add it to spark config
bson-4.7.0.jar
mongodb-driver-legacy-4.7.0.jar
mongo-spark-connector-10.0.3.jar
mongodb-driver-core-4.7.0.jar
mongodb-driver-sync-4.7.0.jar
from pyspark.sql import SparkSession
url = 'mongodb://id:port/Database.collection'
spark = (SparkSession
.builder
.master('local[*]')
.config('spark.driver.extraClassPath','path_to_jars/*')
.config("spark.mongodb.read.connection.uri",url)
.config("spark.mongodb.write.connection.uri", url)
.getOrCreate()
)
df = spark.read.format("mongodb").load()
I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.
My code is:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark = SparkSession \
.builder \
.appName("SparkSQL") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
.getOrCreate()
df = spark.read.format("mongo").load()
When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:
java.lang.NoClassDefFoundError: org/bson/conversions/Bson
I am very new to Spark. Could someone please help me understand how to install the missing Bson reference?
I am trying to connect Spark to a SQL server using this:
#Myscript.py
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example").config("spark.driver.extraClassPath","/home/mssql-jdbc-9.2.1.jre15.jar:/home/sqljdbc_auth.dll")\
.getOrCreate()
sqldb = spark.read \
.format("jdbc") \
.option("url", "jdbc:sqlserver://server:5150;databaseName=testdb;integratedSecurity=true") \
.option("dbtable", "test_tbl") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
sqldb.select('coldate').show()
I have made sure that both the .dll and the .jar is under /home folder. I call it like so:
spark-submit --jars /home/sqljdbc41.jar MyScript.py
py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is not configured for integrated authentication. ClientConnectionId:3462d79d-c165-4607-9790-67a2c786a9cf
Seems like it cannot find the .dll file? I ahve verified it exists under /home.
This error was resolved when I placed the sqljdbc_auth.dll file in "C:\Windows\System32" folder.
For those who want to know where to find this dll file, you may:
Download the JDBC Driver for SQL Server (sqljdbc_6.0.8112.200_enu.exe) from the Microsoft website below:
https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
Unzip it and navigate as follows:
\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\auth\x64
Is there any way to get the database collection names natively with the spark-connector.
Now i'm using pymongo to do it, but I wonder if is possible to do it with the spark connector.
My actual method, just for info:
from pymongo import MongoClient
db = MongoClient().database
db_names = db.collection_names(False)
for name in db_names:
spark = SparkSession\
.builder\
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/database." + name) \
.config("spark.mongodb.output.uri", "mongodb://localhost:27017/database." + name) \
.getOrCreate()
...
With python the Mongo Spark Connector only uses the Spark API, so there is no native way to list collections.
Also, please note the SparkSession is a singleton, so when changing collections the configuration should be done on the DataFrameReader using the option() method.