Is there any way to upsert Mongo Collection with spark-mongo connector based on a certain field in dataframe ?
To replace documents based on unique key constraint, use replaceDocument and shardKey option. Default shardKey is {_id: 1}.
https://docs.mongodb.com/spark-connector/master/configuration/
df.write.format('com.mongodb.spark.sql') \
.option('collection', 'target_collection') \
.option('replaceDocument', 'true') \
.option('shardKey', '{"date": 1, "name": 1, "resource": 1}') \
.mode('append') \
.save()
replaceDocument=false option makes your document merged based on the shardKey.
https://github.com/mongodb/mongo-spark/blob/c9e1bc58cb509021d7b7d03367776b84da6db609/src/main/scala/com/mongodb/spark/MongoSpark.scala#L120-L141
As of MongoDB Connector for Spark version 1.1+ (currently version 2.2)
when you execute save() as below:
dataFrameWriter.write.mongo()
dataFrameWriter.write.format("com.mongodb.spark.sql").save()
If a dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
See also MongoDB Spark SQL for more information and snippets.
Try option replaceDocument
df.select("_id").withColumn("aaa", lit("ha"))
.write
.option("collection", collectionName)
.option("replaceDocument", "false")
.mode(SaveMode.Append)
.format("mongo")
.save()
I dont know why in mongo document can not find any document for this option
With some digging on mongo-spark's source, here's a simple hack to add the feature of upsert on certain fields, to MongoSpark.save method:
// add additional keys parameter
def save[D](dataset: Dataset[D], writeConfig: WriteConfig, keys: List[String]): Unit = {
val mongoConnector = MongoConnector(writeConfig.asOptions)
val dataSet = dataset.toDF()
val mapper = rowToDocumentMapper(dataSet.schema)
val documentRdd: RDD[BsonDocument] = dataSet.rdd.map(row => mapper(row))
val fieldNames = dataset.schema.fieldNames.toList
val queryKeyList = keys.isEmpty match {
case true => keys
case false => BsonDocument.parse(writeConfig.shardKey.getOrElse("{_id: 1}")).keySet().asScala.toList
}
// the rest remains the same
// ...
}
Related
Can anyone please say as how do we enable spark permissive mode in mongo spark connector i.e. replace null for corrupt fields
Example
I have mongo collection with 2 records with following structure for each of them
Record 1:
_id -> String
num -> Int32
Record 2:
_id -> String
num -> String
I explicity pass schema on mongo spark read using following command
val df = MongoSpark.read(spark)
.option("uri", uri)
.option("database", database)
.option("collection", collection)
.schema(schema)
.load
val schema: StructType = new StructType()
.add("_id", StringType, true)
.add("num", IntegerType, true)
I get MongoTypeConversionException as 'num' field in second record is a string. I want mongo spark connector to replace corrupt fields with null and the spark read to succeed.
When learning Spark SQL, I've been using the following approach to register a collection into the Spark SQL catalog and query it.
val persons: Seq[MongoPerson] = Seq(MongoPerson("John", "Doe"))
sqlContext.createDataset(persons)
.write
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.mode("append")
.save()
sqlContext.read
.format("com.mongodb.spark.sql.DefaultSource")
.option("collection", "peeps")
.load()
.as[Peeps]
.show()
However, when querying it, it seems that I need to register it as a temporary view in order to access it using SparkSQL.
val readConfig = ReadConfig(Map("uri" -> "mongodb://localhost:37017/test", "collection" -> "morepeeps"), Some(ReadConfig(spark)))
val people: DataFrame = MongoSpark.load[Peeps](spark, readConfig)
people.show()
people.createOrReplaceTempView("peeps")
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
sqlContext.sql("SELECT * FROM peeps")
.as[Peeps]
.show()
For a database with quite a few collections, is there a way to hydrate the Spark SQL schema catalog so that this op isn't so verbose?
So there's a couple things going on. First of all, simply loading the Dataset using sqlContext.read will not register it with SparkSQL catalog. The end of the function chain you have in your first code sample returns a Dataset at .as[Peeps]. You need to tell Spark that you want to use it as a view.
Depending on what you're doing with it, I might recommend leaning on the Scala Dataset API rather than SparkSQL. However, if SparkSQL is absolutely essential, you can likely speed things up programmatically.
In my experience, you'll need to run that boilerplate on each table you want to import. Fortunately, Scala is a proper programming language, so we can cut down on code duplication substantially by using a function, and calling it as such:
val MongoDbUri: String = "mongodb://localhost:37017/test" // store this as a constant somewhere
// T must be passed in as some case class
// Note, you can also add a second parameter to change the view name if so desired
def loadTableAsView[T <: Product : TypeTag](table: String)(implicit spark: SparkSession): Dataset[T] {
val configMap = Map(
"uri" -> MongoDbUri,
"collection" -> table
)
val readConfig = ReadConfig(configMap, Some(ReadConfig(spark)))
val df: DataFrame = MongoSpark.load[T](spark, readConfig)
df.createOrReplaceTempView(table)
df.as[T]
}
And to call it:
// Note: if spark is defined implicitly, e.g. implicit val spark: SparkSession = spark, you won't need to pass it explicitly
val peepsDS: Dataset[Peeps] = loadTableAsView[Peeps]("peeps")(spark)
val chocolatesDS: Dataset[Chocolates] = loadTableAsView[Chocolates]("chocolates")(spark)
val candiesDS: Dataset[Candies] = loadTableAsView[Candies]("candies")(spark)
spark.catalog.listDatabases().show()
spark.catalog.listTables().show()
peepsDS.show()
chocolatesDS.show()
candiesDS.show()
This will substantially cut down your boilerplate, and also allow you to more easily write some tests for that repeated bit of code. There's also probably a way to create a map of table names to case classes that you can then iterate over, but I don't have an IDE handy to test it out.
I am able to read the data stored in MongoDB via Apache Spark via the conventional methods described in its documentation. I have a mongoDB query that I would like to be used while loading the collection. The query is simple, but I can't seem to find the correct way to specify the query the config() function in SparkSession object.
Following is my SparkSession builder
val confMap: Map[String, String] = Map(
"spark.mongodb.input.uri" -> "mongodb://xxx:xxx#mongodb1:27017,mongodb2:27017,mongodb3:27017/?ssl=true&replicaSet=MongoShard-0&authSource=xxx&retryWrites=true&authMechanism=SCRAM-SHA-1",
"spark.mongodb.input.database" -> "A",
"spark.mongodb.input.collection" -> "people",
"spark.mongodb.output.database" -> "B",
"spark.mongodb.output.collection" -> "result",
"spark.mongodb.input.readPreference.name" -> "primaryPreferred"
)
conf.setAll(confMap)
val spark: SparkSession =
SparkSession.builder().master("local[1]").config(conf).getOrCreate()
Is there a way to specify the MongoDB query in the SparkConf object so that the SparkSession reads only the specific fields present in the collection.
Use .withPipeline API
Example Code:
val readConfig = ReadConfig(Map("uri" -> MONGO_DEV_URI, "collection" -> MONGO_COLLECTION_NAME, "readPreference.name" -> "secondaryPreferred"))
MongoSpark
.load(spark.sparkContext, readConfig)
.withPipeline(Seq(Document.parse(query)))
As per comments:
sparkSession.read.format("com.mongodb.spark.sql.DefaultSource")
.option("pipeline", "[{ $match: { name: { $exists: true } } }]")
.option("uri","mongodb://127.0.0.1/mydb.mycoll")
.load()
We are dealing with schema free JSON data and sometimes the spark jobs are failing as some of the columns we refer in spark SQL are not available for certain hours in the day. During these hours the spark job fails as the column being referred is not available in the data frame. How to handle this scenario? I have tried UDF but we have too many columns missing so can't really check each and every column for availability. I have also tried inferring a schema on a larger data set and applied it on the data frame expecting that missing columns will be filled with null but the schema application fails with weird errors.
Please suggest
This worked for me. Created a function to check all expected columns and add columns to dataframe if it is missing
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit(null).cast(StringType))
}
else (df)
}
}
}
val expectedColumns = List("newcol1","newcol2","newcol3")
val finalDf = checkAvailableColumns(castedDateSessions,expectedColumns)
Here is an improved version of the answer #rads provided
#tailrec
def addMissingFields(fields: List[String])(df: DataFrame): DataFrame = {
def addMissingField(field: String)(df: DataFrame): DataFrame =
df.withColumn(field, lit(null).cast(StringType))
fields match {
case Nil =>
df
case c :: cs if c.contains(".") && !df.columns.contains(c.split('.')(0)) =>
val fields = c.split('.')
// it just supports one level of nested, but it can extend
val schema = StructType(Array(StructField(fields(1), StringType)))
addMissingFields(cs)(addMissingField(fields(0), schema)(df))
case ::(c, cs) if !df.columns.contains(c.split('.')(0)) =>
addMissingFields(cs)(addMissingField(c)(df))
case ::(_, cs) =>
addMissingFields(cs)(df)
}
}
Now you can use it as a transformation:
val df = ...
val expectedColumns = List("newcol1","newcol2","newcol3")
df.transform(addMissingFields(expectedColumns))
I haven't tested it in production yet to see if there is any performance issue. I doubt it. But if there was any, I'll update my post.
Here are the steps to add missing columns:
val spark = SparkSession
.builder()
.appName("Spark SQL json example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val df = spark.read.json
val schema = df.schema
val columns = df.columns // enough for flat tables
You can traverse the auto generated schema. If it is flat table just do
df.columns.
Compare the found columns to the expected columns and add the missing fields like this:
val dataframe2 = df.withColumn("MissingString1", lit(null).cast(StringType) )
.withColumn("MissingString2", lit(null).cast(StringType) )
.withColumn("MissingDouble1", lit(0.0).cast(DoubleType) )
Maybe there is a faster way to add the missing columns in one operation, instead of one by one, but the with withColumns() method which does that is private.
Here's a pyspark solution based on this answer which checks for a list of names (from a configDf - transformed into a list of columns it should have - parameterColumnsToKeepList) - this assumes all missing columns are ints but you could look this up in configdDf dynamically too. My default is null but you could also use 0.
from pyspark.sql.types import IntegerType
for column in parameterColumnsToKeepList:
if column not in processedAllParametersDf.columns:
print('Json missing column: {0}' .format(column))
processedAllParametersDf = processedAllParametersDf.withColumn(column, lit(None).cast(IntegerType()))
i am using "mongo-spark" in order to read mongodb from spark 2.0 application.
(https://github.com/mongodb/mongo-spark)
Here is a code example:
val readConfig: ReadConfig = ReadConfig(Map(
"spark.mongodb.input.uri"-> "mongodb://mongodb01.blabla.com/xqwer",
"collection" -> "some_collection"),
None)
sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions).load()
does anyone know how to add mongodb query (e.g. "find({ uid: 'ZesSZY3Ch0k8nQtQUIfH' })" ) ?
You can use filter() on df
val df = sparkSession.read.format("com.mongodb.spark.sql")
.options(readConfig.asOptions).load()
df.filter($"uid".equalTo(lit("ZesSZY3Ch0k8nQtQUIfH"))).show()