Dedupe MongoDB Collection

Dedupe MongoDB Collection - mongodb

I'm new to NoSQL, so sorry if this is very basic. Let's say I have the following collection:
{
a: 1,
b: 2,
c: 'x'
},
{
a: 1,
b: 2,
c: 'y'
},
{
a: 1,
b: 1,
c: 'y'
}
I would like to run a "Dedupe" query on anything that matches:
{
a: 1,
b: 2
... (any other properties are ignored) ...
},
So after the query is run, either of the following remaining in the collection would be fine:
{
a: 1,
b: 2,
c: 'y'
},
{
a: 1,
b: 1,
c: 'y'
}
OR
{
a: 1,
b: 2,
c: 'x'
},
{
a: 1,
b: 1,
c: 'y'
}
Just so long as there's only one document with a==1 and b==2 remaining.

If you always want to ensure that only one document has any given a, b combination, you can use a unique index on a and b. When creating the index, you can give the dropDups option, which will remove all but one duplicate:
db.collection.ensureIndex({a: 1, b: 1}, {unique: true, dropDups: true})

This answer hasn't been updated in a while. It took me a while to figure this out. First, using the Mongo CLI, connect to the database and create an index on the field which you want to be unique. Here is an example for users with a unique email address:
db.users.createIndex({ "email": 1 }, { unique: true })
The 1 creates the index, along with the existing _id index automatically created.
Now when you run create or save on an object, if that email exists, Mongoose will through a duplication error.

I don't know of any commands that will update your collection in-place, but you can certainly do it via temp storage.
Group your documents by your criteria (fields a and b)
For each group pick any document from it. Save it to temp collection tmp. Discard the rest of the group.
Overwrite original collection with documents from tmp.
You can do this with MapReduce or upcoming Aggregation Framework (currently in unstable branch).
I decided not to write code here as it would take the joy of learning away from you. :)

Related

Using Spark, is there a way to bulk unset a field in Mongo documents?

I have a scala Spark application that I would like to unset the fields for all documents in a Mongo collection before I load updated data into the collection.
Let's say I have a data source like this and I want to remove the "rank" field from all documents (some may have this field and some may not).
[
{
"_id": 123,
"value": "a"
},
{
"_id": 234,
"value": "b",
"rank": 1
},
...
]
I know in mongo there is an unset function, but I don't see any documentation in the mongo spark connector on how to do something like this w/ Spark.
I've tried filtering out the field and dropping it in the Dataset before I save to Mongo but I run into the following error:
com.mongodb.MongoBulkWriteException: Bulk write operation error on server localhost:58200. Write errors: [BulkWriteError{index=0, code=9, message=''$set' is empty. You must specify a field like so: {$set: {<field>: ...}}', details={}}].
at com.mongodb.connection.BulkWriteBatchCombiner.getError(BulkWriteBatchCombiner.java:173)
...
I have the following definitions:
case class Item(_id: Int, rank: Option[Int])
val idCol = new ColumnName("_id")
val rankCol = new ColumnName("rank")
and a function that does something like this in the same class:
def resetRanks(): {
val records = MongoSpark
.load[Item](
sparkSession,
ReadConfig(
Map(
"collection" -> mongoConfig.collection,
"database" -> mongoConfig.db,
"uri" -> mongoConfig.uri
),
Some(ReadConfig(sparkSession))
)
)
.select(idCol, rankCol)
.repartition(sparkConfig.partitionSize, $"_id")
.where(rankCol.isNotNull)
.drop(rankCol)
MongoSpark.save(
records,
WriteConfig(
Map(
"collection" -> mongoConfig.collection,
"database" -> mongoConfig.db,
"forceInsert" -> "false",
"ordered" -> "true",
"replaceDocument" -> "false", // not replacing docs since there are other fields I'd like to keep intact that I won't be modifying
"uri" -> mongoConfig.uri,
"writeConcern.w" -> "majority"
),
Some(WriteConfig(sparkSession))
)
)
}
I'm using MongoSparkConnector v2.4.2.
I also saw this thread which seemed to suggest the reason I get the above error is that that I can't have null fields, but I need to unset these fields so I'm at a lost on how to go about it.
Any tips or pointers are appreciated.

You can try something like this where you can drop the column from the dataframe and write to a new collection. One issue I have observed here is, when trying to write to save collection, my collection was getting dropped, perhaps you can take the research from there.
Here I am directly utilizing the dataframeWriter Save function. You can use the conventional MongoSpark.save() function along with the WriteConfig as you like.
I am using Spark 3.1.2, Mongo-Spark Connector 3.0.1, Mongo 4.2.6
case class Item(id: Int, rank: Option[Int], value: String = "abc")
def main(args: Array[String]): Unit = {
val sparkSession = getSparkSession(args)
val items = MongoSpark.load[Item](sparkSession, ReadConfig(Map("collection" -> "items"), Some(ReadConfig(sparkSession))))
items.show()
val dropped = items.drop("rank")
dropped.write.option("collection", "items-updated").mode("overwrite").format("mongo").save()
dropped.show()
}

What is the difference between the + operator and std.mergePatch in Jsonnet?

Jsonnet's std.mergePatch implements RFC7396, but in my naive testing I didn't find a different between the way it behaved and the + operator; e.g. the + operator respects x+ syntax. std.mergePatch is implemented in Jsonnet itself, which seems to imply that it is different than the + operator, which I'm assuming is a builtin.
What is different about the semantics of these two ways of merging?

Jsonnet's + and std.mergePatch are completely different operations. The + operator operates only on a single level and std.mergePatch traverses the object recursively and merges the nested objects. It's easiest to explain with an example:
local foo = { a: {b1: {c1: 42}}},
bar = { a: {b2: {c2: 2}}};
foo + bar
Output:
{
"a": {
"b2": {
"c2": 2
}
}
}
Note that the bar.a completely replaced foo.a. With + all fields in the second object override the fields in the first object. Compare that with the result of using std.mergePatch(foo, bar).
{
"a": {
"b1": {
"c1": 42
},
"b2": {
"c2": 2
}
}
}
Since both foo and bar have a field a, is is merged and the final results contains both b1 and b2.
So to reiterate, + is a "flat" operation which overrides the fields of the first object with the fields of the second object.
This is not the end of the story, though. You mentioned field+: value syntax and I will try to explain what it really does. In Jsonnet + is not just overwriting, but inheritance in OO sense. It creates an object which is the result of the second object inheriting from the first one. It's a bit exotic to have an operator for that – in all mainstream languages such relationships are statically defined. In Jsonnet, when you do foo + bar, the bar object has access to stuff from foo through super:
{ a: 2 } + { a_plus_1: super.a + 1}
This results in:
{
"a": 2,
"a_plus_1": 3
}
You can use this feature to merge the fields deeper inside:
{ a: {b: {c1: 1}, d: 1}} +
{ a: super.a + {b: {c2: 2} } }
Resulting in:
{
"a": {
"b": {
"c2": 2
},
"d": 1
}
}
This is a bit repetitive, though (it would be annoying if the field name was longer). So we have a nice syntax sugar for that:
{ a: {b: {c1: 1} , d: 1}} +
{ a+: {b: {c2: 2}} }
Please note that in these examples we only did the merging for one particular field we chose. We still replaced the value of a.b. This is much more flexible, because in many cases you can't just naively merge all stuff inside (sometimes a nested object is "atomic" and should be replaced completely).
The version in +: works in the same way as the version with super. The subtle difference is that +: actually translates to something like if field in super then super.field + val else val, so it also returns the same value when super is not provided at all or doesn't have this particular field. For example {a +: {b: 42}} evaluates just fine to {a: { b: 42 }}.
Mandatory sermon: while + is very powerful, please don't abuse it. Consider using using functions instead of inheritance when you need to parameterize something.

How to use 'groupBy' method on a list of Int, & Strings

I managed to group a List of Strings by length:
List("Every", "student", "likes", "Scala").groupBy((element: String) => element.length)
I want to group a tuple (i.e),
("Every", "student", "likes", "Scala", 1, 5, 54, 0, 1, 0)

The groupBy method takes a predicate function as its parameter and uses it to group elements by key and values into a Map collection.
As per the Scala documentation, the definition of the groupBy method is as follows:
groupBy[K](f: (A) ⇒ K): immutable.Map[K, Repr]
Hence assuming that you have a tuple of Int and String and you want to groupBy Strings, I would perform the following steps
1. Create a list from tuple
2. Filter out types other than String
3. Apply groupBy on the list
The code for this as follows :-
("Every", "student", "likes", "Scala", 1, 5, 54, 0, 1, 0)
.productIterator.toList
.filter(_.isInstanceOf[String])
.map(_.asInstanceOf[String])
.groupBy((element: String) => element.length)

How to use nested JSON in schema

Trying to setup a nested schema as follows. I want be able to reject the schema if bb.c is present when aa.a is present.
I've tried without as well as xor
{
Joi.object().keys({
aa: Joi.object().keys({
a: Joi.string(),
b: Joi.string()
}).unknown(true).with("a", "b"),
bb: Joi.object().keys({
c: Joi.string()
}).unknown(true)
}).xor( "aa.a" , ["bb.c"])
}
With the below object xor fails with ValidationError: "value" must contain at least one of [aa.a, bb.c] yet aa.a exists in the supplied values
{
"aa": {
"a": "fg",
"b": "fg"
},
"bb": {
"c": "l"
}
}
If I try
.without( "aa.a" , ["bb.c"])
then the schema passes although in my mind it should not pass as without should fail when bb.c is present along with aa.a
Is it because the two things are nested in other objects perhaps?
Can we not specify deeply linked stuff like this?
Thanks in advance

You can to use Joi.when() and create a schema like this:
Joi.object({
aa: Joi.object().keys({
a: Joi.string(),
b: Joi.string()
}).unknown(true).with('a', 'b'),
bb: Joi.object().keys({
c: Joi.string()
}).unknown(true)
.when('aa.a', {
is: Joi.exist(),
then: Joi.object({ c: Joi.forbidden() })
})
});
Basically what this does is that if aa.a is present then the bb.c is not allowed and the schema will fail it's validation. With this schema your example will fail as you expect.

Joi: Require exactly two of three fields to be non-empty

Here is a simple version of my schema.
var schema = Joi.object().keys({
a: Joi.string(),
b: Joi.string(),
c: Joi.string()
});
I want a, b, c to be exactly 2 out of 3 non-empty. I.e.:
if a, b are not empty, c should not be set
idem with circular permutation of a,b,c
of course, if 2 or more are empty, throw an error too
Tried using .or() but obviously doesn't do the trick. Looked into .alternatives() but didn't get it working.

It's tricky to find an elegant way to handle this without stumbling into circular dependency issues. I've managed to get something working using .alternatives() and .try().
The solution in its raw form would be this:
Joi.alternatives().try(
Joi.object().keys({
a: Joi.string().required(),
b: Joi.string().required(),
c: Joi.string().required().valid('')
}),
Joi.object().keys({
a: Joi.string().required().valid(''),
b: Joi.string().required(),
c: Joi.string().required()
}),
Joi.object().keys({
a: Joi.string().required(),
b: Joi.string().required().valid(''),
c: Joi.string().required()
})
);
It's certainly not pretty and could get pretty bloated if any more dependencies are introduced.
In an attempt to reduce the amount of repetition, the following would also work:
var base = {
a: Joi.string().required(),
b: Joi.string().required(),
c: Joi.string().required()
};
Joi.alternatives().try(
Joi.object().keys(Object.assign({}, base,
{
a: base.a.valid('')
})),
Joi.object().keys(Object.assign({}, base,
{
b: base.b.valid('')
})),
Joi.object().keys(Object.assign({}, base,
{
c: base.c.valid('')
}))
);

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Dedupe MongoDB Collection - mongodb

If you always want to ensure that only one document has any given a, b combination, you can use a unique index on a and b. When creating the index, you can give the dropDups option, which will remove all but one duplicate: db.collection.ensureIndex({a: 1, b: 1}, {unique: true, dropDups: true})

Related

Using Spark, is there a way to bulk unset a field in Mongo documents?

What is the difference between the + operator and std.mergePatch in Jsonnet?

How to use 'groupBy' method on a list of Int, & Strings

How to use nested JSON in schema

Joi: Require exactly two of three fields to be non-empty

Categories

Resources