Sink Kafka Stream to MongoDB using PySpark Structured Streaming - mongodb

My Spark:
spark = SparkSession\
.builder\
.appName("Demo")\
.master("local[3]")\
.config("spark.streaming.stopGracefullyonShutdown", "true")\
.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\
.getOrCreate()
Mongo URI:
input_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
output_uri_weld = 'mongodb://127.0.0.1:27017/db.coll1'
Function for writing stream batches to Mongo:
def save_to_mongodb_collection(current_df, epoc_id, mongodb_collection_name):
current_df.write\
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.option("spark.mongodb.output.uri", output_uri_weld) \
.save()
Kafka Stream:
kafka_df = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", kafka_broker)\
.option("subscribe", kafka_topic)\
.option("startingOffsets", "earliest")\
.load()
Write to Mongo:
mongo_writer = df_parsed.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode('append')\
.option("spark.mongodb.output.uri", output_uri_weld)\
.save()
& my spark.conf file:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0
Error:
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource. Please find packages at http://spark.apache.org/third-party-projects.html

I found a solution.
Since I couldn't find the right Mongo driver for Structured Streaming, I worked on another solution.
Now, I use the direct connection to mongoDb, and use "foreach(...)" instead of foreachbatch(...). My code looks like this in testSpark.py file:
....
import pymongo
from pymongo import MongoClient
local_url = "mongodb://localhost:27017"
def write_machine_df_mongo(target_df):
cluster = MongoClient(local_url)
db = cluster["test_db"]
collection = db.test1
post = {
"machine_id": target_df.machine_id,
"proc_type": target_df.proc_type,
"sensor1_id": target_df.sensor1_id,
"sensor2_id": target_df.sensor2_id,
"time": target_df.time,
"sensor1_val": target_df.sensor1_val,
"sensor2_val": target_df.sensor2_val,
}
collection.insert_one(post)
machine_df.writeStream\
.outputMode("append")\
.foreach(write_machine_df_mongo)\
.start()

Related

Update/Replace value in Mongo Database using Mongo Spark Connector (Pyspark) v10x

I am using the spark version in the image below. Details:
mongo-spark-connector:10.0.5
Spark version 3.1.3
And I config the spark-mongo-connector by following:
spark = SparkSession.builder \
.appName("hello") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config('spark.driver.memory', '2g') \
.config('spark.driver.cores', '4') \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.5') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar') \
.enableHiveSupport() \
.getOrCreate()
I want to ask the question, how to update and replace value in Mongo Database.
I read the following question in Updating mongoData with MongoSpark. But it is successful for mongo-spark v2.x. About mongo-spark v10 above is fail.
Example:
I have these following attributes:
from bson.objectid import ObjectId
data = {
'_id' : ObjectId("637367d5262dc89a8e318d09"),
'database' : database_name,
"table" : table,
"latestSyncAt": lastestSyncAt,
"lastest_id" : str(lastest_id)
}
df = spark.createDataFrame(data)
How do I update or replace _id attribute value in Mongo Database by using Mongo-spark-connector?
Thank you very much for your support.

java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing

I'm trying to perform read/write streaming data from CosmosDB API for MongoDB into databricks pyspark and gettting error java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
Please help anyone how can we achieve data streaming in pyspark.
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.streaming import *
from pyspark.sql.types import StringType,BooleanType,DateType,StructType,LongType,IntegerType
spark = SparkSession.\
builder.\
appName("streamingExampleRead").\
config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.0').\
getOrCreate()
sourceConnectionString = <primary connection string of cosmosDB API for MongoDB isntance>
sourceDb = <your database name>
sourceCollection = <yourcollection name>
dataStreamRead=(
spark.readStream.format("mongodb")
.option('spark.mongodb.connection.uri', sourceConnectionString)
.option('spark.mongodb.database', sourceDb) \
.option('spark.mongodb.collection', sourceCollection) \
.option('spark.mongodb.change.stream.publish.full.document.only','true') \
.option("forceDeleteTempCheckpointLocation", "true") \
.load()
)
display(dataStreamRead)
query2=(dataStreamRead.writeStream \
.outputMode("append") \
.option("forceDeleteTempCheckpointLocation", "true") \
.format("console") \
.trigger(processingTime='1 seconds')
.start().awaitTermination());
Getting following error:
java.lang.UnsupportedOperationException: Data source mongodb does not support microbatch processing.
at org.apache.spark.sql.errors.QueryExecutionErrors$.microBatchUnsupportedByDataSourceError(QueryExecutionErrors.scala:1579)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:123)
Data source mongodb does not support microbatch processing.
=== Streaming Query ===
Identifier: [id = 78cfcef1-19de-40f4-86fc-847109263ee9, runId = d2212e1f-5247-4cd2-9c8c-3cc937e2c7c5]
Current Committed Offsets: {}
Current Available Offsets: {}
Current State: INITIALIZING
Thread State: RUNNABLE```
Try using trigger(continuous="1 second") instead of trigger(processingTime='1 seconds').

pyspark spark-submit unable to read from Mongo Atlas serverless(can read from free version)

I've been using Apache Spark(pyspark) to read from MongoDB Atlas, I've a shared(free) cluster - which has a limit of 512 MB storage
I'm trying to migrate to serverless, but somehow unable to connect to the serverless instance - error
pyspark.sql.utils.IllegalArgumentException: requirement failed: Invalid uri: 'mongodb+srv://vani:<password>#versa-serverless.w9yss.mongodb.net/versa?retryWrites=true&w=majority'
Pls note :
I'm able to connect to the instance using pymongo, but not using pyspark.
Here is the pyspark code (Not Working):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MongoDB operations").getOrCreate()
print(" spark ", spark)
# cluster0 - is the free version, and i'm able to connect to this
# mongoConnUri = "mongodb+srv://vani:password#cluster0.w9yss.mongodb.net/?retryWrites=true&w=majority"
mongoConnUri = "mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/?retryWrites=true&w=majority"
mongoDB = "versa"
collection = "name_map_unique_ip"
df = spark.read\
.format("mongo") \
.option("uri", mongoConnUri) \
.option("database", mongoDB) \
.option("collection", collection) \
.load()
Error :
22/07/26 12:25:36 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/07/26 12:25:36 INFO SharedState: Warehouse path is 'file:/Users/karanalang/PycharmProjects/Versa-composer-mongo/composer_dags/spark-warehouse'.
spark <pyspark.sql.session.SparkSession object at 0x7fa1d8b9d5e0>
Traceback (most recent call last):
File "/Users/karanalang/PycharmProjects/Kafka/python_mongo/StructuredStream_readFromMongoServerless.py", line 30, in <module>
df = spark.read\
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 164, in load
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1309, in __call__
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.IllegalArgumentException: requirement failed: Invalid uri: 'mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/?retryWrites=true&w=majority'
22/07/26 12:25:36 INFO SparkContext: Invoking stop() from shutdown hook
22/07/26 12:25:36 INFO SparkUI: Stopped Spark web UI at http://10.42.28.205:4040
pymongo code (am able to connect using the same uri):
from pymongo import MongoClient, ReturnDocument
# from multitenant_management import models
client = MongoClient("mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/vani?retryWrites=true&w=majority")
print(client)
all_dbs = client.list_database_names()
print(f"all_dbs : {all_dbs}")
any ideas how to debug/fix this ?
tia!

Spark Structured Stream qubole Kinesis connector errors out with "Got an exception while fetching credentials"

I am using the following code to write to Kinesis from a spark structured stream code. It errors out with following error. The AWS credentials have admin access. I am able to use aws console using that. What could be the issue here?
22/03/16 13:46:34 ERROR AWSInstanceProfileCredentialsProviderWithRetries: Got an exception while fetching credentials org.apache.spark.sql.kinesis.shaded.amazonaws.SdkClientException: Unable to load credentials from service endpoint
val finalDF = rawDF.select(expr("CAST(rand() AS STRING) as partitionKey"),
to_json(struct("*")).alias("data"))
finalDF.printSchema()
val query = finalDF.writeStream
.outputMode("update")
.format("kinesis")
.option("streamName", "sparkstream2")
.option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com")
.option("region", "us-east-1")
.option("awsAccessKey", "") // Creds removed
.option("awsSecretKey", "") // Creds removed
.option("checkpointLocation", "chk-point-dir")
.start()
query.awaitTermination()
spark.stop()
Printschema output looks as follows
root
|-- partitionKey: string (nullable = false)
|-- data: string (nullable = true)
I am using the connector from qubole
https://github.com/qubole/kinesis-sql
I had this issue too - add on .option("awsUseInstanceProfile", "false"). The kinesis-sql package doesn't handle AWS credentials by default as one would expect. I found this GitHub issue for a lead here.

How to convert a SnowflakeCursor in to a pySpark dataframe

The result of sending an SQL to Snowflake is a SnowflakeCursor. How would I easily convert it into a pySpark dataframe?
Thanks!
When using Databricks (https://docs.databricks.com/data/data-sources/snowflake.html), we can use the spark.read to load the result of an SQL statement into a dataframe. Note that specifying the sfRole may be the key to gain access to your database objects.
options = {
"sfUrl": "https://yourinstance.snowflakecomputing.com/",
"sfUser": user,
"sfPassword": pw,
"sfDatabase": "db",
"sfSchema": "schema",
"sfRole": "Accountadmin",
"sfWarehouse": "wh"
}
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", strCheckingSQL) \
.load()