I have to migrate data from a structure from
{
"_id": "some-id",
"_class": "org.some.class",
"number": 1015,
"timestamp": {"$date": "2020-09-05T12:08:02.809Z"},
"cost": 0.9200000166893005
}
to
{"_id": {
"productId": "some-id",
"countryCode": "DE"
},
"_class": "org.some.class",
"number": 1015,
"timestamp": {"$date": "2020-09-05T12:08:02.809Z"},
"cost": 0.9200000166893005
}
The change that is in the new document is the _id field is replaced by a complex _id object (productId : String, country : String).
The country field is to be completed for the entire collection with a specific value - DE.
The collection has about 40 million records in the old format and 700k in the new format. I would like to bring these 40 million to this new form. I’m using mongo 3.6, so I’m a bit limited and I’ll probably have to use the aggregate functions to create a completely new collection, and then remove the old one.
I will be grateful for help on how to do it - how the query that will do it should look like and how to keep these migrated 700k documents.
What I have got so far:
db.productDetails.aggregate(
{$match: {_id: {$exists: true}}},
{$addFields: {"_id": {"productId": "$_id", "country": "DE"}},
{$project: {_id: 1, _class: 1, number: 1, timestamp: 1, cost: 1}},
{$out: "productDetailsV2"}
)
but this solution would only work if I didn't have 700k documents in the new form.
Your query is in the right direction. You may want to modify the $match filter a bit to better catch the old type documents.
db.collection.aggregate([
{
$match: {
"_id.country": {
$exists: false
}
}
},
{
$addFields: {
"_id": {
"productId": "$_id",
"country": "DE"
}
}
},
{
$project: {
"_id": 1,
"_class": 1,
"number": 1,
"timestamp": 1,
"cost": 1
}
},
{
$out: "productDetailsV2"
}
])
Mongo Playground
Related
My collection, userresults, has documents which are unique by userref and sessionref together. A session has a selection of game results in a results array. I have already filtered the results to return those userresults documents which contain a result for game “Clubs”.
[{
"userref": "AAA",
"sessionref" : "S1",
"results": [{
"gameref": "Spades",
"dateplayed": ISODate(2022-01-01T10:00:00),
"score": 1000
}, {
"gameref": "Hearts",
"dateplayed": ISODate(2022-01-02T10:00:00),
"score": 500
}, {
"gameref": "Clubs",
"dateplayed": ISODate(2022-01-05T10:00:00),
"score": 200
}]
}, {
"userref": "AAA",
"sessionref" : "S2",
"results": [{
"gameref": "Spades",
"dateplayed": ISODate(2022-02-02T10:00:00),
"score": 1000
}, {
"gameref": "Clubs",
"dateplayed": ISODate(2022-05-02T10:00:00),
"score": 200
}]
}, {
"userref": "BBB",
"sessionref" : "S1",
"results": [{
"gameref": "Clubs",
"dateplayed": ISODate(2022-01-05T10:00:00),
"score": 200
}]
}]
What I need to do within my aggregation is select the userresult document FOR EACH USER that contains the most recently played game of Clubs, ie in this case it will return the AAA/S2 document and the BBB/S1 document.
I’m guessing I need a group on the userref as a starting point, but then how do I select the rest of the document based on the most recent Clubs date?
Thanks!
If I've understood correctly you can try this aggregation pipeline:
First I've used $filter to avoid $unwind into the entire collection. With this you can get only objects into the results array where the gameref is Clubs.
Next stage is now the $unwind but in this case only with remaining documents, not the entire collection. Note that this stage will not pass to the next stage documents where there is no any "gameref": "Clubs".
Now $sort the remaining results by dateplayed to get the recent date at first position.
And last $group using $first to get the data you want. As documents are sorted by dateplayed, you can get desired result.
db.collection.aggregate([
{
"$set": {
"results": {
"$filter": {
"input": "$results",
"cond": {
"$eq": [
"$$this.gameref",
"Clubs"
]
}
}
}
}
},
{
"$unwind": "$results"
},
{
"$sort": {
"results.dateplayed": -1
}
},
{
"$group": {
"_id": "$userref",
"results": {
"$first": "$results"
},
"sessionref": {
"$first": "$sessionref"
}
}
}
])
Example here
How to get the distinct values of all the fields within the mongodb collection using single query.
{ "_id": "1", "Gender": "male", "car": "bmw" , "house":"2bhk" , "married_to": "kalpu"},
{ "_id": "2", "Gender": "female", "car": nan , "house":"3bhk", "married_to": "kalpu"},
{ "_id": "3", "Gender": "female", "car": "audi", "house":"1bhk", "married_to": "deepa"},
This is an example with few fields, In my actual collection, each document has atleast 50 fields. So how to query effeciently that will return unique values within each of the fields? Thanks in advance for help.
Answer expected:
for each field,
Gender:"male", "female"
car :"bmw", "audi",.....
house : "3hbk","2bhk","1bhk"
married_to: "kalpu","deepa",....
....
....
...
You can use aggregation pipeline $group stage with $addToSet operator
db.collection.aggregate([
{
$group: {
_id: null,
Gender: {
"$addToSet": "$Gender"
},
car: {
"$addToSet": "$car"
},
house: {
"$addToSet": "$house"
},
married_to: {
"$addToSet": "$married_to"
},
}
}
])
Working Example
I need to get the latest documents that are in an array of ids based on data/time. I have the following query that does this, but it only returns the _id and acquiredTime fields. How can I get it to return the full document with all the fields?
db.trip.aggregate([
{ $match: { tripId: { $in: ["trip01", "trip02" ]}} },
{ $sort: { acquiredTime: -1} },
{ $group: { _id: "$tripId" , acquiredTime: { $first: "$acquiredTime" }}}
])
The collection looks something like:
[{
"tripId": "trip01",
"acquiredTime": 1000,
"name": "abc",
"value": "abc"
},{
"tripId": "trip02",
"acquiredTime": 1000,
"name": "xyz",
"value": "xyz"
},{
"tripId": "trip01",
"acquiredTime": 2000,
"name": "def",
"value": "abc"
},{
"tripId": "trip02",
"acquiredTime": 2000,
"name": "ghi",
"value": "xyz"
}]
At the moment I get:
[{
"tripId": "trip01",
"acquiredTime": 2000
},{
"tripId": "trip02",
"acquiredTime": 2000
}]
I need to get:
[{
"tripId": "trip01",
"acquiredTime": 2000,
"name": "def",
"value": "abc"
},{
"tripId": "trip02",
"acquiredTime": 2000,
"name": "ghi",
"value": "xyz"
}]
Your approach is the right approach, but the thing is that $group and $project just don't work that way and require you to name all of the fields you want in the result.
If you don't mind the structure looking a bit different, then you can always use $$ROOT in MongoDB versions 2.6 and greater:
db.trip.aggregate([
{ "$match": { "tripId": { "$in": ["trip01", "trip02" ]}} },
{ "$sort": { "acquiredTime": -1} },
{ "$group": { "_id": "$tripId" , "doc": { "$first": "$$ROOT" }}}
])
So the whole document is there, but just all contained as a sub-document to "doc" in the results.
For anything else or prettier you are going to have to specify every field that you want. It's just a data structure so you could always generate it from code anyway.
db.trip.aggregate([
{ "$match": { "tripId": { "$in": ["trip01", "trip02" ]}} },
{ "$sort": { "acquiredTime": -1} },
{ "$group": {
"_id": "$tripId" ,
"acquiredTime": { "$first": "$acquiredTime" },
"name": { "$first": "$name" },
"value": { "$first": "$value" }
}}
])
To my understading, the above solution suffers from performance and RAM problems when there is a large number of unique documents to be returned, as the output of $match is sorted in memory, no matter what indices you may have.
Reference: https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/
To maximise performance and minimise RAM usage:
Create a unique index [(tripId, 1), (acquiredTime, -1)]
Have the sort to operate exactly along the index
This of course will cost you an index, which will slow down inserts - there's no free meal :)
Additionally, the cosmetic problem of having the original document moved to a sub-document can be easily solved with $replaceRoot, without needing to explicltly list the document keys.
db.trip.aggregate([
{ "$match": { "tripId": { "$in": ["trip01", "trip02" ]}} },
{ "$sort": SON([("tripId", 1), ("acquiredTime", -1)],
{ "$group": { "_id": "$tripId" , "doc": { "$first": "$$ROOT" }}},
{ "$replaceRoot": { "newRoot": "$doc"}}
])
Finally, it's worth noting that if acquiredTime is just your server time, you can get rid of it, as the _id already embeds the creation timestamp. So the unique index would go on [(tripId, 1), (_id, -1)], and the query becomes:
db.trip.aggregate([
{ "$match": { "tripId": { "$in": ["trip01", "trip02" ]}} },
{ "$sort": SON([("tripId", 1), ("_id", -1)],
{ "$group": { "_id": "$tripId" , "doc": { "$first": "$$ROOT" }}},
{ "$replaceRoot": { "newRoot": "$doc"}}
])
This is also better as date objects in MongoDB have a resolution of 1 millisecond, which - depending on the frequency of your inserts - may result in extremely hard to reproduce race conditions, whereas auto-generated _id are guaranteed to be strictly incremental.
i am having db structure like this below.
{
"_id": ObjectId("53770b9de4b0ba6f4c976a27"),
"source": [{
"value": 5127,
"createdAt": ISODate("2014-05-7T07:11:00Z"),
"generated": ISODate("2014-05-17T07:23:00Z"),
}, {
"value": 5187,
"createdAt": ISODate("2014-05-17T07:39:00Z"),
"generated": ISODate("2014-05-17T07:40:00Z"),
}, {
"value": 5187,
"createdAt": ISODate("2014-05-17T07:39:00Z"),
"generated": ISODate("2014-05-17T07:41:00Z")
}],
}
In this there is a duplicate in the subdocument array.I need to write the mongo db query to retrive all the sub - document and if there is any duplicates then based on the "generated"
values i need to rerive the latest on like below.
{
"_id": ObjectId("53770b9de4b0ba6f4c976a27"),
"source": [{
"value": 5127,
"createdAt": ISODate("2014-05-17T07:11:00Z"),
}, {
"value": 5187,
"createdAt": ISODate("2014-05-17T07:39:00Z"),
"generated": ISODate("2014-05-17T07:41:00Z")
}],
}
Is there any way to get the data like using mongo db query?
With aggregation framework you can finish the job.
db.test.aggregate([
{$unwind: '$source'},
{$group: {_id: {value: "$source.value", createdAt: "$source.createdAt"}, generated: {$max: "$source.generated"}}}
]);
Which gives you the result:
{ "_id" : { "value" : 5187, "createdAt" : ISODate("2014-05-17T07:39:00Z") }, "generated" : ISODate("2014-05-17T07:41:00Z") }
{ "_id" : { "value" : 5127, "createdAt" : ISODate("2014-05-07T07:11:00Z") }, "generated" : ISODate("2014-05-17T07:23:00Z") }
A little bit different from what you want. But if you really want the format above, try this:
db.test.aggregate([
{$unwind: '$source'},
{$group: {_id: {_id: "$_id", value: "$source.value", createdAt: "$source.createdAt"}, generated: {$max: "$source.generated"}}},
{$group: {_id: "$_id._id", source: {$push: {value: "$_id.value", createdAt: "$_id.createdAt", generated: "$generated"}}}}
]);
which gives you:
{
"_id": ObjectId("53770b9de4b0ba6f4c976a27"),
"source": [{
"value": 5187,
"createdAt": ISODate("2014-05-17T07:39:00Z"),
"generated": ISODate("2014-05-17T07:41:00Z")
}, {
"value": 5127,
"createdAt": ISODate("2014-05-07T07:11:00Z"),
"generated": ISODate("2014-05-17T07:23:00Z")
}]
}
There are many documents:
{
"_id" : ObjectId("506ddd1900a47d802702a904"),
"subid" : "s1",
"total" : "300",
"details" :[{
name:"d1", value: "100"
},
{
name:"d2", value: "200"
}]
}
{
"_id" : ObjectId("306fff1900a47d802702567"),
"subid" : "s1",
"total" : "700",
"details" : [{
name:"d1", value: "300"
},
{
name:"d8", value: "400"
}]
}
Elements in 'details' arrays may vary.
Question is: how can I get such result with aggregation framework and java?
{
"_id" : "s1",
"total" : "1000",
"details" : [{
name:"d1", value: "400"
},
{
name:"d2", value: "200"
},
{
name:"d8", value: "400"
}]
}
Or maybe I should use custom map-reduce functions here?
This is very achievable with aggregate, though a little obtuse, but lets run through it:
db.collection.aggregate([
// First Group to get the *master* total for the documents
{"$group": {
"_id": "$subid",
"total": { "$sum": "$total" },
details: { "$push": "$details" }
}},
// Unwind the details
{"$unwind": "$details"},
// Unwind the details "again" since you *pushed* and array onto an array
{"$unwind":"$details"},
// Now sum up the values by each name (keeping levels)
{"$group": {
"_id:" {
"_id": "$_id",
"total": "$total",
"name": "$details.name"
},
"value": {"$sum": "$details.value"}
}},
// Sort the names (because you expect that!)
{"$sort": { "_id.name": 1}},
// Do some initial re-shaping for convenience
{"$project": {
"_id": "$_id._id",
"total": "$_id.total",
"details": { "name": "$_id.name", "value": "$value" }
}},
// Now push everything back into an array form
{"$group": {
"_id": {
"_id": "$_id",
"total": "$total"
},
"details": {"$push": "$details"}
}},
// And finally project nicely
{"$project": {
"_id": "$_id._id",
"total": "$_id.total",
"details": 1
}}
])
So if you gave that a try before, you might have missed the concept of doing the initial group to get the top level sum on your total field in your documents.
Admittedly, the tricky bit is "getting your head around" the whole double unwind thing that comes next. Since in that first group we pushed an array into another array, then we now end up with this new nested structure that you need to unwind twice in order to come to a "de-normalized" form.
Once you've done that, you just $group up to the name field:
equiv ( GROUP BY _id, total, "details.name" )
So more or less like that with some sensible re-shaping. Then I ask to sort by the name key (because you printed it that way), and finally we $project into the actual form that you wanted.
So Bingo, we have your result. Thanks for the cool question to show the use of a double unwind.