how to connect to mongodb Atlas from databricks cluster using pyspark - mongodb

how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing

I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.

Related

Pyspark not able to read from bigquery table

I am running the below code to pull a bigquery table using Pyspark. The spark session has been initiated without any issue but I am not able to connect to the table in public dataset. Here is the error that I get from running the script.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('Optimize BigQuery Storage') \
.config('spark.jars.packages', 'gs://spark-lib/bigquery/spark-3.1-bigquery-0.27.1-preview.jar') \
.getOrCreate()
df = spark.read \
.format("bigquery") \
.load("bigquery-public-data.samples.shakespeare")
https://i.stack.imgur.com/actAv.png

mongodb pyspark connector set up

i have followed the link here to install, build is succesful but I cannot find the connector.
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/intca2.tweetsIntca2") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
Py4JJavaError: An error occurred while calling o592.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource
the connector was downloaded and built here
https://github.com/mongodb/mongo-spark#please-see-the-downloading-instructions-for-information-on-getting-and-using-the-mongodb-spark-connector
I Am using ubuntu 20.04
Change to
df = spark.read.format("mongodb").load()
Then, you have to tell pyspark where to find the mongo libs, e.g.
/usr/local/bin/spark-submit --jars $HOME/java/lib/mongo-spark-connector-10.0.0.jar,$HOME/java/lib/mongodb-driver-sync-4.3.2.jar,$HOME/java/lib/mongodb-driver-core-4.3.2.jar,$HOME/java/lib/bson-4.3.2.jar mongo_spark1.py
I'm running pyspark in local mode.
Mongodb version 4
Spark version 3.2.1
I download all needed jars in one folder(path_to_jars) and add it to spark config
bson-4.7.0.jar
mongodb-driver-legacy-4.7.0.jar
mongo-spark-connector-10.0.3.jar
mongodb-driver-core-4.7.0.jar
mongodb-driver-sync-4.7.0.jar
from pyspark.sql import SparkSession
url = 'mongodb://id:port/Database.collection'
spark = (SparkSession
.builder
.master('local[*]')
.config('spark.driver.extraClassPath','path_to_jars/*')
.config("spark.mongodb.read.connection.uri",url)
.config("spark.mongodb.write.connection.uri", url)
.getOrCreate()
)
df = spark.read.format("mongodb").load()

Pyspark, MongoDB and Missing BSON Reference

I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.
My code is:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark = SparkSession \
.builder \
.appName("SparkSQL") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
.getOrCreate()
df = spark.read.format("mongo").load()
When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:
java.lang.NoClassDefFoundError: org/bson/conversions/Bson
I am very new to Spark. Could someone please help me understand how to install the missing Bson reference?

Spark cannot find SQL server jdbc driver even if both the mssql.jar and the .dll are present in the path

I am trying to connect Spark to a SQL server using this:
#Myscript.py
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example").config("spark.driver.extraClassPath","/home/mssql-jdbc-9.2.1.jre15.jar:/home/sqljdbc_auth.dll")\
.getOrCreate()
sqldb = spark.read \
.format("jdbc") \
.option("url", "jdbc:sqlserver://server:5150;databaseName=testdb;integratedSecurity=true") \
.option("dbtable", "test_tbl") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
sqldb.select('coldate').show()
I have made sure that both the .dll and the .jar is under /home folder. I call it like so:
spark-submit --jars /home/sqljdbc41.jar MyScript.py
py4j.protocol.Py4JJavaError: An error occurred while calling o51.load.
: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is not configured for integrated authentication. ClientConnectionId:3462d79d-c165-4607-9790-67a2c786a9cf
Seems like it cannot find the .dll file? I ahve verified it exists under /home.
This error was resolved when I placed the sqljdbc_auth.dll file in "C:\Windows\System32" folder.
For those who want to know where to find this dll file, you may:
Download the JDBC Driver for SQL Server (sqljdbc_6.0.8112.200_enu.exe) from the Microsoft website below:
https://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774
Unzip it and navigate as follows:
\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\auth\x64

Get collection names with Spark connector MongoDB

Is there any way to get the database collection names natively with the spark-connector.
Now i'm using pymongo to do it, but I wonder if is possible to do it with the spark connector.
My actual method, just for info:
from pymongo import MongoClient
db = MongoClient().database
db_names = db.collection_names(False)
for name in db_names:
spark = SparkSession\
.builder\
.config("spark.mongodb.input.uri", "mongodb://localhost:27017/database." + name) \
.config("spark.mongodb.output.uri", "mongodb://localhost:27017/database." + name) \
.getOrCreate()
...
With python the Mongo Spark Connector only uses the Spark API, so there is no native way to list collections.
Also, please note the SparkSession is a singleton, so when changing collections the configuration should be done on the DataFrameReader using the option() method.