pyspark spark-submit unable to read from Mongo Atlas serverless(can read from free version) - mongodb

I've been using Apache Spark(pyspark) to read from MongoDB Atlas, I've a shared(free) cluster - which has a limit of 512 MB storage
I'm trying to migrate to serverless, but somehow unable to connect to the serverless instance - error
pyspark.sql.utils.IllegalArgumentException: requirement failed: Invalid uri: 'mongodb+srv://vani:<password>#versa-serverless.w9yss.mongodb.net/versa?retryWrites=true&w=majority'
Pls note :
I'm able to connect to the instance using pymongo, but not using pyspark.
Here is the pyspark code (Not Working):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MongoDB operations").getOrCreate()
print(" spark ", spark)
# cluster0 - is the free version, and i'm able to connect to this
# mongoConnUri = "mongodb+srv://vani:password#cluster0.w9yss.mongodb.net/?retryWrites=true&w=majority"
mongoConnUri = "mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/?retryWrites=true&w=majority"
mongoDB = "versa"
collection = "name_map_unique_ip"
df = spark.read\
.format("mongo") \
.option("uri", mongoConnUri) \
.option("database", mongoDB) \
.option("collection", collection) \
.load()
Error :
22/07/26 12:25:36 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/07/26 12:25:36 INFO SharedState: Warehouse path is 'file:/Users/karanalang/PycharmProjects/Versa-composer-mongo/composer_dags/spark-warehouse'.
spark <pyspark.sql.session.SparkSession object at 0x7fa1d8b9d5e0>
Traceback (most recent call last):
File "/Users/karanalang/PycharmProjects/Kafka/python_mongo/StructuredStream_readFromMongoServerless.py", line 30, in <module>
df = spark.read\
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 164, in load
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1309, in __call__
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.IllegalArgumentException: requirement failed: Invalid uri: 'mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/?retryWrites=true&w=majority'
22/07/26 12:25:36 INFO SparkContext: Invoking stop() from shutdown hook
22/07/26 12:25:36 INFO SparkUI: Stopped Spark web UI at http://10.42.28.205:4040
pymongo code (am able to connect using the same uri):
from pymongo import MongoClient, ReturnDocument
# from multitenant_management import models
client = MongoClient("mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/vani?retryWrites=true&w=majority")
print(client)
all_dbs = client.list_database_names()
print(f"all_dbs : {all_dbs}")
any ideas how to debug/fix this ?
tia!

Related

java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing

I'm trying to perform read/write streaming data from CosmosDB API for MongoDB into databricks pyspark and gettting error java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
Please help anyone how can we achieve data streaming in pyspark.
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.streaming import *
from pyspark.sql.types import StringType,BooleanType,DateType,StructType,LongType,IntegerType
spark = SparkSession.\
builder.\
appName("streamingExampleRead").\
config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.0').\
getOrCreate()
sourceConnectionString = <primary connection string of cosmosDB API for MongoDB isntance>
sourceDb = <your database name>
sourceCollection = <yourcollection name>
dataStreamRead=(
spark.readStream.format("mongodb")
.option('spark.mongodb.connection.uri', sourceConnectionString)
.option('spark.mongodb.database', sourceDb) \
.option('spark.mongodb.collection', sourceCollection) \
.option('spark.mongodb.change.stream.publish.full.document.only','true') \
.option("forceDeleteTempCheckpointLocation", "true") \
.load()
)
display(dataStreamRead)
query2=(dataStreamRead.writeStream \
.outputMode("append") \
.option("forceDeleteTempCheckpointLocation", "true") \
.format("console") \
.trigger(processingTime='1 seconds')
.start().awaitTermination());
Getting following error:
java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
at org.apache.spark.sql.errors.QueryExecutionErrors$.microBatchUnsupportedByDataSourceError(QueryExecutionErrors.scala:1579)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:123)
Data source mongodb does not support microbatch processing.
=== Streaming Query ===
Identifier: [id = 78cfcef1-19de-40f4-86fc-847109263ee9, runId = d2212e1f-5247-4cd2-9c8c-3cc937e2c7c5]
Current Committed Offsets: {}
Current Available Offsets: {}
Current State: INITIALIZING
Thread State: RUNNABLE```
Try using trigger(continuous="1 second") instead of trigger(processingTime='1 seconds').

Sink Kafka Stream to MongoDB using PySpark Structured Streaming

My Spark:
spark = SparkSession\
.builder\
.appName("Demo")\
.master("local[3]")\
.config("spark.streaming.stopGracefullyonShutdown", "true")\
.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\
.getOrCreate()
Mongo URI:
input_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
output_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
Function for writing stream batches to Mongo:
def save_to_mongodb_collection(current_df, epoc_id, mongodb_collection_name):
current_df.write\
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.option("spark.mongodb.output.uri", output_uri_weld) \
.save()
Kafka Stream:
kafka_df = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", kafka_broker)\
.option("subscribe", kafka_topic)\
.option("startingOffsets", "earliest")\
.load()
Write to Mongo:
mongo_writer = df_parsed.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode('append')\
.option("spark.mongodb.output.uri", output_uri_weld)\
.save()
& my spark.conf file:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0
Error:
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html
I found a solution.
Since I couldn't find the right Mongo driver for Structured Streaming, I worked on another solution.
Now, I use the direct connection to mongoDb, and use "foreach(...)" instead of foreachbatch(...). My code looks like this in testSpark.py file:
....
import pymongo
from pymongo import MongoClient
local_url = "mongodb://localhost:27017"
def write_machine_df_mongo(target_df):
cluster = MongoClient(local_url)
db = cluster["test_db"]
collection = db.test1
post = {
"machine_id": target_df.machine_id,
"proc_type": target_df.proc_type,
"sensor1_id": target_df.sensor1_id,
"sensor2_id": target_df.sensor2_id,
"time": target_df.time,
"sensor1_val": target_df.sensor1_val,
"sensor2_val": target_df.sensor2_val,
}
collection.insert_one(post)
machine_df.writeStream\
.outputMode("append")\
.foreach(write_machine_df_mongo)\
.start()

Issue with write from Databricks to Azure cosmos DB

I am trying to write data from Spark (using Databricks) to Azure Cosmos DB(Mongo DB). There are no errors when executing notebook but i am getting below error when querying the collection.
I have used jar from databricks website azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar. My versions is 6.5 (includes Apache Spark 2.4.5, Scala 2.11)
import org.joda.time.format._
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.functions._
val configMap = Map(
"Endpoint" -> "https://******.documents.azure.com:***/",
"Masterkey" -> "****==",
"Database" -> "dbname",
"Collection" -> "collectionname"
)
val config = Config(configMap)
val df = spark.sql("select id,name,address from emp_db.employee")
CosmosDBSpark.save(df, config)
when i query the collection i get below response
Error: error: {
"_t" : "OKMongoResponse",
"ok" : 0,
"code" : 1,
"errmsg" : "Unknown server error occurred when processing this request.",
"$err" : "Unknown server error occurred when processing this request."
}
Any help would be much appreciated. Thank you!!!
That error suggests you are using CosmosDB with the MongoDB api.
The spark connector for CosmosDB only supports it when using the SQL api.
Instead you should use the MongoDB connector.
https://learn.microsoft.com/en-us/azure/cosmos-db/spark-connector
Use this instead https://docs.mongodb.com/spark-connector/master/

Error connecting from pyspark to mongodb with password

I'm running following pyspark code with connection to mongodb
sparkConf = SparkConf().setMaster("local").setAppName("MongoSparkConnectorTour").set("spark.app.id", "MongoSparkConnectorTour")
# If executed via pyspark, sc is already instantiated
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)
# create and load dataframe from MongoDB URI
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")\
.option("spark.mongodb.input.uri", config.MONGO_URL_AUTH + "/spark.times")\
.load()
inside Docker image with
CMD [ "spark-submit" \
, "--conf", "spark.mongodb.input.uri=mongodb://root:example#mongodb:27017/spark.times" \
, "--conf", "spark.mongodb.output.uri=mongodb://root:example#mongodb:27017/spark.output" \
, "--packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.4.1" \
, "./spark.py" ]
config.MONGO_URL_AUTH is mongodb://root:example#mongodb:27017
but I'm getting exception on run:
db_1 | 2019-10-09T13:44:34.354+0000 I ACCESS [conn4] Supported SASL mechanisms requested for unknown user 'root#spark'
db_1 | 2019-10-09T13:44:34.378+0000 I ACCESS [conn4] SASL SCRAM-SHA-1 authentication failed for root on spark from client 172.22.0.4:49302 ; UserNotFound: Could not find user "root" for db "spark"
pyspark_1 | Traceback (most recent call last):
pyspark_1 | File "/home/ubuntu/./spark.py", line 35, in <module>
pyspark_1 | .option("spark.mongodb.input.uri", config.MONGO_URL_AUTH + "/spark.times")\
pyspark_1 | File "/home/ubuntu/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
pyspark_1 | return self._df(self._jreader.load())
pyspark_1 | File "/home/ubuntu/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
pyspark_1 | File "/home/ubuntu/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
pyspark_1 | return f(*a, **kw)
pyspark_1 | File "/home/ubuntu/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
pyspark_1 | py4j.protocol.Py4JJavaError: An error occurred while calling o34.load.
pyspark_1 | : com.mongodb.MongoSecurityException: Exception authenticating MongoCredential{mechanism=SCRAM-SHA-1, userName='root', source='spark', password=<hidden>, mechanismProperties={}}
pyspark_1 | at com.mongodb.internal.connection.SaslAuthenticator.wrapException(SaslAuthenticator.java:173)
everything in this setup works flawlessly if I don't use user and password in mongodb docker and just connect with mongodb://mongodb:27017 address, and just with using pymongo I can connect with password, something is wrong with my spark to mongodb configuration when password is used and I can't understand what is wrong.
setup for mongodb (part of docker-compose file):
db:
image: mongo
restart: always
networks:
miasnet:
aliases:
- "miasdb"
environment:
MONGO_INITDB_ROOT_USERNAME: root
MONGO_INITDB_ROOT_PASSWORD: example
MONGO_INITDB_DATABASE: spark
ports:
- "27017:27017"
volumes:
- /data/db:/data/db
https://hub.docker.com/_/mongo reads:
MONGO_INITDB_ROOT_USERNAME, MONGO_INITDB_ROOT_PASSWORD
These variables, used in conjunction, create a new user and set that user's password. This user is created in the admin authentication database and given the role of root, which is a "superuser" role.
You don't specify the authentication database, so mongo uses current database by default - spark in your case.
You need to specify "admin" auth database in the connection string:
spark.mongodb.input.uri=mongodb://root:example#mongodb:27017/spark.times?authSource=admin
spark.mongodb.output.uri=mongodb://root:example#mongodb:27017/spark.output?authSource=admin

Using KuduContext in pyspark

I would like to use kudu with pyspark.
While I can use it with:
sc.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"hdp1:7051").option('kudu.table',"impala::test.z_kudu_tab").load()
I cannot find a way to import KuduContext.
I'm working in a jupyter notebook, and importing it with:
os.environ["PYSPARK_SUBMIT_ARGS"] = "--driver-memory 2g --packages com.ibm.spss.hive.serde2.xml:hivexmlserde:1.0.5.3 --packages org.apache.kudu:kudu-spark2_2.11:1.7.0 pyspark-shell"
My not working code:
kudu_Context = KuduContext("es2-hdp1:7051", sc)
Dies with error:
NameError: name 'KuduContext' is not defined
I've also tried:
kudu_context = sc._jvm.org.apache.kudu.spark.kudu.KuduContext("hdp1:7051", sc.sparkContext)
which dies with error:
AttributeError: 'SparkContext' object has no attribute '_get_object_id'