Using Spark, is there a way to bulk unset a field in Mongo documents? - mongodb

I have a scala Spark application that I would like to unset the fields for all documents in a Mongo collection before I load updated data into the collection.
Let's say I have a data source like this and I want to remove the "rank" field from all documents (some may have this field and some may not).
[
{
"_id": 123,
"value": "a"
},
{
"_id": 234,
"value": "b",
"rank": 1
},
...
]
I know in mongo there is an unset function, but I don't see any documentation in the mongo spark connector on how to do something like this w/ Spark.
I've tried filtering out the field and dropping it in the Dataset before I save to Mongo but I run into the following error:
com.mongodb.MongoBulkWriteException: Bulk write operation error on server localhost:58200. Write errors: [BulkWriteError{index=0, code=9, message=''$set' is empty. You must specify a field like so: {$set: {<field>: ...}}', details={}}].
at com.mongodb.connection.BulkWriteBatchCombiner.getError(BulkWriteBatchCombiner.java:173)
...
I have the following definitions:
case class Item(_id: Int, rank: Option[Int])
val idCol = new ColumnName("_id")
val rankCol = new ColumnName("rank")
and a function that does something like this in the same class:
def resetRanks(): {
val records = MongoSpark
.load[Item](
sparkSession,
ReadConfig(
Map(
"collection" -> mongoConfig.collection,
"database" -> mongoConfig.db,
"uri" -> mongoConfig.uri
),
Some(ReadConfig(sparkSession))
)
)
.select(idCol, rankCol)
.repartition(sparkConfig.partitionSize, $"_id")
.where(rankCol.isNotNull)
.drop(rankCol)
MongoSpark.save(
records,
WriteConfig(
Map(
"collection" -> mongoConfig.collection,
"database" -> mongoConfig.db,
"forceInsert" -> "false",
"ordered" -> "true",
"replaceDocument" -> "false", // not replacing docs since there are other fields I'd like to keep intact that I won't be modifying
"uri" -> mongoConfig.uri,
"writeConcern.w" -> "majority"
),
Some(WriteConfig(sparkSession))
)
)
}
I'm using MongoSparkConnector v2.4.2.
I also saw this thread which seemed to suggest the reason I get the above error is that that I can't have null fields, but I need to unset these fields so I'm at a lost on how to go about it.
Any tips or pointers are appreciated.

You can try something like this where you can drop the column from the dataframe and write to a new collection. One issue I have observed here is, when trying to write to save collection, my collection was getting dropped, perhaps you can take the research from there.
Here I am directly utilizing the dataframeWriter Save function. You can use the conventional MongoSpark.save() function along with the WriteConfig as you like.
I am using Spark 3.1.2, Mongo-Spark Connector 3.0.1, Mongo 4.2.6
case class Item(id: Int, rank: Option[Int], value: String = "abc")
def main(args: Array[String]): Unit = {
val sparkSession = getSparkSession(args)
val items = MongoSpark.load[Item](sparkSession, ReadConfig(Map("collection" -> "items"), Some(ReadConfig(sparkSession))))
items.show()
val dropped = items.drop("rank")
dropped.write.option("collection", "items-updated").mode("overwrite").format("mongo").save()
dropped.show()
}

Related

UDF in Spark works very slow

I have a UDF in spark (running on EMR), written in scala that parses device from user agent using uaparser library for scala (uap-scala). When working on small sets it works fine (5000 rows) but when running on larger sets (2M) it works very slow.
I tried collecting the Dataframe to list and looping over it on the driver, and that was also very slow, what makes me believe that the UDF runs on the driver and not the workers
How can I establish this? does anyone have another theory?
if that is the case, why can this happen?
This is the udf code:
def calcDevice(userAgent: String): String = {
val userAgentVal = Option(userAgent).getOrElse("")
Parser.get.parse(userAgentVal).device.family
}
val calcDeviceValUDF: UserDefinedFunction = udf(calcDevice _)
usage:
.withColumn("agentDevice", udfDefinitions.calcDeviceValUDF($"userAgent"))
Thanks
Nir
Problem was with instantiating the builder within the UDF itelf. The solution is to create the object outside the udf and use it at row level:
val userAgentAnalyzerUAParser = Parser.get
def calcDevice(userAgent: String): String = {
val userAgentVal = Option(userAgent).getOrElse("")
userAgentAnalyzerUAParser.parse(userAgentVal).device.family
}
val calcDeviceValUDF: UserDefinedFunction = udf(calcDevice _)
We ran into the same issue where Spark jobs were hanging. One additional thing we did was to use a broadcast variable. This UDF is actually very slow after all the changes so your mileage may vary. One other caveat is that of acquiring the SparkSession; we run in Databricks and if the SparkSession isn't available then it will crash; if you need the job to continue then you have to deal with that failure case.
object UDFs extends Serializable {
val uaParser = SparkSession.getActiveSession.map(_.sparkContext.broadcast(CachingParser.default(100000)))
val parseUserAgent = udf { (userAgent: String) =>
// We will simply return an empty map if uaParser is None because that would mean
// there is no active spark session to broadcast the parser.
//
// Also if you wrap the potentially null value in an Option and use flatMap and map to
// add type safety it becomes slower.
if (userAgent == null || uaParser.isEmpty) {
Map[String, Map[String, String]]()
} else {
val parsed = uaParser.get.value.parse(userAgent)
Map(
"browser" -> Map(
"family" -> parsed.userAgent.family,
"major" -> parsed.userAgent.major.getOrElse(""),
"minor" -> parsed.userAgent.minor.getOrElse(""),
"patch" -> parsed.userAgent.patch.getOrElse("")
),
"os" -> Map(
"family" -> parsed.os.family,
"major" -> parsed.os.major.getOrElse(""),
"minor" -> parsed.os.minor.getOrElse(""),
"patch" -> parsed.os.patch.getOrElse(""),
"patch-minor" -> parsed.os.patchMinor.getOrElse("")
),
"device" -> Map(
"family" -> parsed.device.family,
"brand" -> parsed.device.brand.getOrElse(""),
"model" -> parsed.device.model.getOrElse("")
)
)
}
}
}
You might also want to play with the size of the CachingParser.
Given Parser.get.parse is missing from the question, it is possible to judge only udf part.
For performance you can remove Option:
def calcDevice(userAgent: String): String = {
val userAgentVal = if(userAgent == null) "" else userAgent
Parser.get.parse(userAgentVal).device.family
}

Not persisting Scala None's instead of persisting as null value

I noticed that the scala driver (version 1.2.1) writes Option values of None as null to the corresponding field. I would prefer omitting the fieid completely in this case. Is this possible?
Example
case class Test(foo: Option[String])
persist(Test(None))
leads to
> db.test.find()
{ "_id": "...", "foo": null }
but I want to achieve
> db.test.find()
{ "_id": "..." }
When I used casbah, I think my intended behaviour was the default.
http://mongodb.github.io/mongo-scala-driver/2.4/bson/macros/
Now you can use macros for it:
val testCodec = Macros.createCodecProviderIgnoreNone[Test]()
and in codec conf:
lazy val codecRegistry: CodecRegistry = fromRegistries(fromProviders(testCodec))
Opened a feature request in the mongodb bug tracker (https://jira.mongodb.org/browse/SCALA-294), which was answered by Ross Lawley. He suggests to change the conversion code (from case class to document) from
def toDocument(t: Test) = Document("foo" -> t.foo)
to something like
def toDocument(t: Test) = {
val d = Document()
t.foo.foreach{ value =>
d + ("foo" -> value)
}
d
}

How to update a document using ReactiveMongo

I get the following list of documents back from MongoDB when I find for "campaignID":"DEMO-1".
[
{
"_id": {
"$oid": "56be0e8b3cf8a2d4f87ddb97"
},
"campaignID": "DEMO-1",
"revision": 1,
"action": [
"kick",
"punch"
],
"transactionID": 20160212095539543
},
{
"_id": {
"$oid": "56c178215886447ea261710f"
},
"transactionID": 20160215000257159,
"campaignID": "DEMO-1",
"revision": 2,
"action": [
"kick"
],
"transactionID": 20160212095539578
}
]
Now, what I am trying to do here is for a given campaignID I need to find all its versions (revision in my case) and modify the action field to dead of type String. I read the docs and the examples they have is too simple not too helpful in my case. This is what the docs say:
val selector = BSONDocument("name" -> "Jack")
val modifier = BSONDocument(
"$set" -> BSONDocument(
"lastName" -> "London",
"firstName" -> "Jack"),
"$unset" -> BSONDocument(
"name" -> 1))
// get a future update
val futureUpdate = collection.update(selector, modifier)
I can't just follow the docs because its easy to create a new BSON document and use it to modify following the BSON structure by hardcoding the exact fields. In my case I need to find the documents first and then modify the action field on the fly because unlike the docs, my action field can have different values.
Here's my code so far which obviously does not compile:
def updateDocument(campaignID: String) ={
val timeout = scala.concurrent.duration.Duration(5, "seconds")
val collection = db.collection[BSONCollection](collectionName)
val selector = BSONDocument("action" -> "dead")
val modifier = collection.find(BSONDocument("campaignID" -> campaignID)).cursor[BSONDocument]().collect[List]()
val updatedResults = Await.result(modifier, timeout)
val mod = BSONDocument(
"$set" -> updatedResults(0),
"$unset" -> BSONDocument(
"action" -> **<???>** ))
val futureUpdate = collection.update(selector, updatedResults(0))
futureUpdate
}
This worked for me as an answer to my own question. Thanks #cchantep for helping me out.
val collection = db.collection[BSONCollection](collectionName)
val selector = BSONDocument("campaignID" -> campaignID)
val mod = BSONDocument("$set" -> BSONDocument("action" -> "dead"))
val futureUpdate = collection.update(selector, mod, multi = true)
If you have a look at the BSON documentation, you can see BSONArray can be used to pass sequence of BSON values.
BSONDocument("action" -> BSONArray("kick", "punch"))
If you have List[T] as values, with T being provided a BSONWriter[_ <: BSONValue, T], then this list can be converted as BSONArray.
BSONDocument("action" -> List("kick", "punch"))
// as `String` is provided a `BSONWriter`

Reactive Mongo Extensions: Pass List Of Values In $in Query using `Query DSL`

I am trying to pass multiples value in $in query using Query DSL with Reactive Mongo Extensions. But the result is empty list. Follwoing is my Code:
def findUsersByRolesIds(rolesIds: List[BSONObjectID], page: Int, pageSize: Int): Future[List[User]] = {
logger.info("findUsersByRolesIds Reactive Repository Method");
userGenericRepo.find($doc("userRoles._id" $in (rolesIds)), $doc("createdOn" -> -1 ), page, pageSize);
}
When i am trying to execute above code, the result was empty.
But when i pass, below code the result was return.
def findUsersByRolesIds(rolesIds: List[BSONObjectID], page: Int, pageSize: Int): Future[List[User]] = {
logger.info("findUsersByRolesIds Reactive Repository Method");
userGenericRepo.find($doc("userRoles._id" $in (BSONObjectID.apply("5548b098b964e7039852ff58"))), $doc("createdOn" -> -1 ), page, pageSize);
}
The main problem is that, i have multiple value, so that's why i create the list but here the list is not working. How this query is possible with reactive mongo extenstions and Query DSL.
$in expects varargs, ie val dsl: BSONDocument = "age" $in (1, 2, 3). So you cannot directly pass a collection to it. Try using this "age" $in (rolesIds: _*).

How to query with '$in' over '_id' in reactive mongo and play

I have a project set up with playframework 2.2.0 and play2-reactivemongo 0.10.0-SNAPSHOT. I'd like to query for few documents by their ids, in a fashion similar to this:
def usersCollection = db.collection[JSONCollection]("users")
val ids: List[String] = /* fetched from somewhere else */
val query = ??
val users = usersCollection.find(query).cursor[User].collect[List]()
As a query I tried:
Json.obj("_id" -> Json.obj("$in" -> ids)) // 1
Json.obj("_id.$oid" -> Json.obj("$in" -> ids)) // 2
Json.obj("_id" -> Json.obj("$oid" -> Json.obj("$in" -> ids))) // 3
for which first and second return empty lists and the third fails with error assertion 10068 invalid operator: $oid.
NOTE: copy of my response on the ReactiveMongo mailing list.
First, sorry for the delay of my answer, I may have missed your question.
Play-ReactiveMongo cannot guess on its own that the values of a Json array are ObjectIds. That's why you have to make a Json object for each id that looks like this: {"$oid": "526fda0f9205b10c00c82e34"}. When the ReactiveMongo Play plugin sees an object which first field is $oid, it treats it as an ObjectId so that the driver can send the right type for this value (BSONObjectID in this case.)
This is a more general problem actually: the JSON format does not match exactly the BSON one. That's the case for numeric types (BSONInteger, BSONLong, BSONDouble), BSONRegex, BSONDateTime, and BSONObjectID. You may find more detailed information in the MongoDB documentation: http://docs.mongodb.org/manual/reference/mongodb-extended-json/ .
I managed to solve it with:
val objectIds = ids.map(id => Json.obj("$oid" -> id))
val query = Json.obj("_id" -> Json.obj("$in" -> objectIds))
usersCollection.find(query).cursor[User].collect[List]()
since play-reactivemongo format considers BSONObjectID only when "$oid" is followed by string
implicit object BSONObjectIDFormat extends PartialFormat[BSONObjectID] {
def partialReads: PartialFunction[JsValue, JsResult[BSONObjectID]] = {
case JsObject(("$oid", JsString(v)) +: Nil) => JsSuccess(BSONObjectID(v))
}
val partialWrites: PartialFunction[BSONValue, JsValue] = {
case oid: BSONObjectID => Json.obj("$oid" -> oid.stringify)
}
}
Still, I hope there is a cleaner solution. If not, I guess it makes it a nice pull request.
I'm wondering if transforming id to BSONObjectID isn't more secure this way :
val ids: List[String] = ???
val bsonObjectIds = ids.map(BSONObjectID.parse(_)).collect{case Success(t) => t}
this will only generate valid BSONObjectIDs (and discard invalid ones)
If you do it this way :
val objectIds = ids.map(id => Json.obj("$oid" -> id))
your objectIds may not be valid ones depending on string id really being the stringify version of a BSONObjectID or not
If you import play.modules.reactivemongo.json._ it work without any $oid formatters.
import play.modules.reactivemongo.json._
...
val ids: Seq[BSONObjectID] = ???
val selector = Json.obj("_id" -> Json.obj("$in" -> ids))
usersCollection.find(selector).cursor[User].collect[Seq]()
I tried with the following and it worked for me:
val listOfItems = BSONArray(51, 61)
val query = BSONDocument("_id" -> BSONDocument("$in" -> listOfItems))
val ruleListFuture = bsonFutureColl.flatMap(_.find(query, Option.empty[BSONDocument]).cursor[ResponseAccDataBean]().
collect[List](-1, Cursor.FailOnError[List[ResponseAccDataBean]]()))