Issue with write from Databricks to Azure cosmos DB - mongodb

I am trying to write data from Spark (using Databricks) to Azure Cosmos DB(Mongo DB). There are no errors when executing notebook but i am getting below error when querying the collection.
I have used jar from databricks website azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar. My versions is 6.5 (includes Apache Spark 2.4.5, Scala 2.11)
import org.joda.time.format._
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.functions._
val configMap = Map(
"Endpoint" -> "https://******.documents.azure.com:***/",
"Masterkey" -> "****==",
"Database" -> "dbname",
"Collection" -> "collectionname"
)
val config = Config(configMap)
val df = spark.sql("select id,name,address from emp_db.employee")
CosmosDBSpark.save(df, config)
when i query the collection i get below response
Error: error: {
"_t" : "OKMongoResponse",
"ok" : 0,
"code" : 1,
"errmsg" : "Unknown server error occurred when processing this request.",
"$err" : "Unknown server error occurred when processing this request."
}
Any help would be much appreciated. Thank you!!!

That error suggests you are using CosmosDB with the MongoDB api.
The spark connector for CosmosDB only supports it when using the SQL api.
Instead you should use the MongoDB connector.
https://learn.microsoft.com/en-us/azure/cosmos-db/spark-connector
Use this instead https://docs.mongodb.com/spark-connector/master/

Related

pyspark spark-submit unable to read from Mongo Atlas serverless(can read from free version)

I've been using Apache Spark(pyspark) to read from MongoDB Atlas, I've a shared(free) cluster - which has a limit of 512 MB storage
I'm trying to migrate to serverless, but somehow unable to connect to the serverless instance - error
pyspark.sql.utils.IllegalArgumentException: requirement failed: Invalid uri: 'mongodb+srv://vani:<password>#versa-serverless.w9yss.mongodb.net/versa?retryWrites=true&w=majority'
Pls note :
I'm able to connect to the instance using pymongo, but not using pyspark.
Here is the pyspark code (Not Working):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MongoDB operations").getOrCreate()
print(" spark ", spark)
# cluster0 - is the free version, and i'm able to connect to this
# mongoConnUri = "mongodb+srv://vani:password#cluster0.w9yss.mongodb.net/?retryWrites=true&w=majority"
mongoConnUri = "mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/?retryWrites=true&w=majority"
mongoDB = "versa"
collection = "name_map_unique_ip"
df = spark.read\
.format("mongo") \
.option("uri", mongoConnUri) \
.option("database", mongoDB) \
.option("collection", collection) \
.load()
Error :
22/07/26 12:25:36 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/07/26 12:25:36 INFO SharedState: Warehouse path is 'file:/Users/karanalang/PycharmProjects/Versa-composer-mongo/composer_dags/spark-warehouse'.
spark <pyspark.sql.session.SparkSession object at 0x7fa1d8b9d5e0>
Traceback (most recent call last):
File "/Users/karanalang/PycharmProjects/Kafka/python_mongo/StructuredStream_readFromMongoServerless.py", line 30, in <module>
df = spark.read\
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 164, in load
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1309, in __call__
File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.IllegalArgumentException: requirement failed: Invalid uri: 'mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/?retryWrites=true&w=majority'
22/07/26 12:25:36 INFO SparkContext: Invoking stop() from shutdown hook
22/07/26 12:25:36 INFO SparkUI: Stopped Spark web UI at http://10.42.28.205:4040
pymongo code (am able to connect using the same uri):
from pymongo import MongoClient, ReturnDocument
# from multitenant_management import models
client = MongoClient("mongodb+srv://vani:password#versa-serverless.w9yss.mongodb.net/vani?retryWrites=true&w=majority")
print(client)
all_dbs = client.list_database_names()
print(f"all_dbs : {all_dbs}")
any ideas how to debug/fix this ?
tia!

Spark/MongoDB connector with $or pipeline syntax

I'm trying to request a MongoDB collection using Spark with the mongo-spark-connector. Here is the code :
val document = Document.parse("{$or: [{key1:\"A\", key2:\"500\"}, {key1:\"A\", key2:\"100\"}] }")
val mongoDf = mongoCollectionRdd.withPipeline(Seq(document)).toDF(mongoCollectionSchema)
But I get the following error :
com.mongodb.MongoCommandException: Command failed with error 40324 (Location40324): 'Unrecognized pipeline stage name: '$or'' on server hostname.fr:27017.
I ran the query through the mongo shell and the execution is fine and I get the expected result. Here are the versions of the different components :
Scala : 2.12.11
Spark : 3.2.0
MongoDB : 4.2.17
Any ideas ?

AWS GLUE ERROR : An error occurred while calling o75.pyWriteDynamicFrame. Cannot cast STRING into a IntegerType (value: BsonString{value=''})

I have a simple glue pyspark job, which connects to Mongodb source through a glue catalog table and extracts data from Mongodb collections and writes to json output into s3 using a glue dynamic frame.
The Mongo database here is deeply nested no sql with structs and arrays. Since it is a no-sql db, source schema is not fixed. Nested columns may vary between document to document.
However, the job fails with the below error.
ERROR: py4j.protocol.Py4JJavaError: An error occurred while calling o75.pyWriteDynamicFrame.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 6, 10.3.29.22, executor 1): com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})
As, the job fails due to datatype mismatch reason, I have tried all possible solutions like using resolveChoice(). Since error is for property with 'int' datatype, I tried casting all the property with 'int' type to 'string'.
I also tried the code with dropnullfields, writing with spark dataframe, applymapping, without using catalog table (from_options directly from mongo table), with and without repartition.
All these attempts are commented in the code for reference.
CODE SNIPPET
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
print("Started")
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "<catalog_db_name>", table_name = "<catalog_table_name>", additional_options = {"database": "<mongo_database_name>", "collection": "<mongo_db_collection>"}, transformation_ctx = "datasource0")
# Code to read data directly from mongo database
# datasource0 = glueContext.create_dynamic_frame_from_options(connection_type = "mongodb", connection_options = { "uri": "<connection_string>", "database": "<mongo_db_name>", "collection": "<mongo_collection>", "username": "<db_username>", "password": "<db_password>"})
# Code sample for resolveChoive (converted all the 'int' datatype to 'string'
# resolve_dyf = datasource0.resolveChoice(specs = [("nested.property", "cast:string"),("nested.further[].property", "cast:string")])
# Code sample to dropnullfields
# dyf_dropNullfields = DropNullFields.apply(frame = resolve_dyf, transformation_ctx = "dyf_dropNullfields")
data_sink0 = datasource0.repartition(1)
print("Repartition done")
# Code sample to sink using spark's write method
# data_sink0.write.format("json").option("header","true").save("s3://<s3_folder_path>")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = data_sink0, connection_type = "s3", connection_options = {"path": "s3://<S3_folder_path>"}, format = "json", transformation_ctx = "datasink1")
print("Data Sink complete")
job.commit()
NOTE
I am not exactly sure why it is happening because this isssue is intermittent. Sometimes it works perfectly but at times it fails. So it is quite confusing.
Any help will be highly appreciated.
I was facing the same problem. Simple solution of this is to increase the sample size from 1000 (which is default for MongoDB) to 100000. Adding sample config for your reference.
`read_config = {
"uri": documentdb_write_uri,
"database": "your_db",
"collection": "your_collection",
"username": "user",
"password": "password",
"partitioner": "MongoSamplePartitioner",
"sampleSize": "100000",
"partitionerOptions.partitionSizeMB": "1000",
"partitionerOptions.partitionKey": "_id"
}`

writing to Mongo replica set from Spark (in scala)

I'm trying to write from a Spark RDD to MongoDB using the mongo-spark-connector.
I'm facing two problems
[main problem] I can't connect to Mongo if I define the host according to the documentation (using all instances in the mongo replica set)
[secondary/related problem] If I connect to the primary only, I can write... but I typically crash the primary writing the first collection
Environment:
mongo-spark-connector 1.1
spark 1.6
scala 2.10.5
First I'll setup a dummy example to demonstrate...
import org.bson.Document
import com.mongodb.spark.MongoSpark
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.rdd.RDD
/**
* fake json data
*/
val recs: List[String] = List(
"""{"a": 123, "b": 456, "c": "apple"}""",
"""{"a": 345, "b": 72, "c": "banana"}""",
"""{"a": 456, "b": 754, "c": "cat"}""",
"""{"a": 876, "b": 43, "c": "donut"}""",
"""{"a": 432, "b": 234, "c": "existential"}"""
)
val rdd_json_str: RDD[String] = sc.parallelize(recs, 5)
val rdd_hex_bson: RDD[Document] = rdd_json_str.map(json_str => Document.parse(json_str))
Some values that won't change...
// credentials
val user = ???
val pwd = ???
// fixed values
val db = "db_name"
val replset = "replset_name"
val collection_name = "collection_name"
Here's what does NOT work... in this case "url" would look something like machine.unix.domain.org and "ip" would look like... well, an IP address.
This is how the documentation says to define the host... with every machine in the replica set.
val host = "url1:27017,url2:27017,url3:27017"
val host = "ip_address1:27017,ip_address2:27017,ip_address3:27017"
I can't get either of these to work. Using every permutation I can think of for the uri...
val uri = s"mongodb://${user}:${pwd}#${host}/${db}?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}#${host}/?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}#${replset}/${host}/${db}"
val uri = s"mongodb://${user}:${pwd}#${replset}/${host}/${db}.${collection_name}"
val uri = s"mongodb://${user}:${pwd}#${host}" // setting db, collection, replica set in WriteConfig
val uri = s"mongodb://${user}:${pwd}#${host}/${db}" // this works IF HOST IS PRIMARY ONLY; not for hosts as defined above
EDIT
more detail on the error messages.. the errors take to forms...
form 1
typically includes java.net.UnknownHostException: machine.unix.domain.org
also, comes back with server addresses in url form even when defined as IP addresses
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}]
form 2
(authentication error... though connecting with same credentials to primary only works fine)
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=xx.xx.xx.xx:27017,
type=UNKNOWN, state=CONNECTING, exception=
{com.mongodb.MongoSecurityException: Exception authenticating
MongoCredential{mechanism=null, userName='xx', source='admin', password=
<hidden>, mechanismProperties={}}}, caused by
{com.mongodb.MongoCommandException: Command failed with error 18:
'Authentication failed.' on server xx.xx.xx.xx:27017. The full response is {
"ok" : 0.0, "errmsg" : "Authentication failed.", "code" : 18, "codeName" :
"AuthenticationFailed", "operationTime" : { "$timestamp" : { "t" :
1534459121, "i" : 1 } }, "$clusterTime" : { "clusterTime" : { "$timestamp" :
{ "t" : 1534459121, "i" : 1 } }, "signature" : { "hash" : { "$binary" :
"xxx=", "$type" : "0" }, "keyId" : { "$numberLong" : "123456" } } } }}}...
end EDIT
here's what DOES work... on the dummy data only... more on that below...
val host = s"${primary_ip_address}:27017" // primary only
val uri = s"mongodb://${user}:${pwd}#${host}/${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri" -> uri,
"database" -> db,
"collection" -> collection_name,
"replicaSet" -> replset))
// write data to mongo
MongoSpark.save(rdd_hex_bson, writeConfig)
This... connecting to primary only... works great for dummy data, but crashes the primary for real data (50 - 100GB from and RDD with 2700 partitions). My guess is that it opens up too many connections at once... it looks like it opens ~900 connections to write (this jives since default parallelism 2700 based on 900 virtual cores and parellelism factor of 3x).
I'm guessing if I repartition so it opens fewer connections, I'll have better luck... but I'm guessing this also ties in to writing to the primary only instead of spreading it over all instances.
I've read everything I can find here... but most examples are for single instance connections... https://docs.mongodb.com/spark-connector/v1.1/configuration/#output-configuration
It turns out there were two problems here. From the original question, these were referenced as errors of 'form 1' and 'form 2'.
error of 'form 1' - solution
The gist of the problem turned out to be a bug in the mongo-spark-connector. It turns out that it can't connect to a replica set using IP addresses... it requires URIs. Since the DNS servers in our cloud don't have these lookups, I got it working by modifying /etc/hosts on every executor and then using the connection string format like this:
val host = "URI1:27017,URI2:27017,URI3:27017"
val uri = s"mongodb://${user}:${pwd}#${host}/${db}?replicaSet=${replset}&authSource=${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri"->uri,
"database"->db,
"collection"->collection,
"replicaSet"->replset,
"writeConcern.w"->"majority"))
this required first adding the following to /etc/hosts on every machine:
IP1 URI1
IP2 URI2
IP3 URI3
Now of course, i can't figure out how to use bootstrap actions in AWS EMR to update /etc/hosts when the cluster spins up. But that's another question. (AWS EMR bootstrap action as sudo)
error of 'form 2' - solution
adding &authSource=${db} to the uri solved this.

MongoDB Performance Testing Using Jmeter (Connection Issue)

We are trying do performance testing using Jmeter. Database is MongoDB.
Using JSR223 Sampler with Groovy 2.4.10.
import com.mongodb.DB;
import org.apache.jmeter.protocol.mongodb.config.MongoDBHolder;
import com.mongodb.BasicDBObject;
import com.mongodb.DBObject;
import com.mongodb.DBCollection;
import com.mongodb.WriteConcern;
import com.mongodb.WriteResult;
DB db = MongoDBHolder.getDBFromSource("admin", "databasename", "username", "password");
DBCollection collection = db.getCollection("test");
long count = collection.getCount();
String result = String.valueOf(count);
SampleResult.setResponseData(result.getBytes())
Getting below error.
Response code: 500
Response message: javax.script.ScriptException: com.mongodb.CommandFailureException: { "serverUsed" : "l4abcddb1232/11.20.132.301:27017" , "ok" : 0.0 , "errmsg" : "not authorized on databasename to execute command { count: \"test\", query: {} }" , "code" : 13 , "codeName" : "Unauthorized"}
Above issue in Dev database.
Also, how to connect SSL to connect mongodb database (QC)?
Thank you in advance!
Bharathi
This is more mongodb questions, but see authentication and connection to start working with authorized user and SSL connection.