Suppose I have a array of objects as below.
"array" : [
{
"id" : 1
},
{
"id" : 2
},
{
"id" : 2
},
{
"id" : 4
}
]
If I want to retrieve multiple objects ({id : 2}) from this array, the aggregation query goes like this.
db.coll.aggregate([{ $match : {"_id" : ObjectId("5492690f72ae469b0e37b61c")}}, { $unwind : "$array"}, { $match : { "array.id" : 2}}, { $group : { _id : "$_id", array : { $push : { id : "$array.id"}}}} ])
The output of above aggregation is
{
"_id" : ObjectId("5492690f72ae469b0e37b61c"),
"array" : [
{
"id" : 2
},
{
"id" : 2
}
]
}
Now the question is:
1) Is retrieving of multiple objects from an array possible using find() in MongoDB?
2) With respect to performance, is aggregation is the correct way to do? (Because we need to use four pipeline operators) ?
3) Can we use Java manipulation (looping the array and only keep {id : 2} objects) to do this after
find({"_id" : ObjectId("5492690f72ae469b0e37b61c")}) query? Because find will once retrieve the document and keeps it in RAM. But if we use aggregation four operations need to be performed in RAM to get the output.
Why I asked the 3) question is: Suppose if thousands of clients accessing at the same time, then RAM memory will be overloaded. If it is done using Java, less task on RAM.
4) For how long the workingSet will be in RAM??
Is my understanding correct???
Please correct me if I am wrong.
Please suggest me to have right insight on this..
No. You project the first matching one with $, you project all of them, or you project none of them.
No-ish. If you have to work with this array, aggregation is what will allow you to extract multiple matching elements, but the correct solution, conceptually and for performance, is to design your document structure so this problem does not arise, or arises only for rare queries whose performance is not particularly important.
Yes.
We have no information that would allow us to give a reasonable answer to this question. This is also out of scope relative to the rest of the question and should be a separate question.
Related
I recently started shifting data from Microsoft SQL Server to MongoDB to obtain scalability. All good in term of migration.
The document has 2 important fields: customer, timestamphash (year month day).
We imported only 75 Million data in Azure Linux where we install MongoDB.
After adding compound index on both fields, we are having the following problem:
On 3 Milion data (after filterin) it takes 24 seconds to finish an aggregate group by count by customerId. The SQL Server gives the result in less then 1 second on the same data.
Do you think Casandra will be a better solution? We need query performance on big number of data.
I tried disk write, giving the VM more RAM. Nothing works.
Query:
aaggregate([
{ "$match" : { "Customer" : 2 } },
{ "$match" : { "TimestampHash" : { "$gte" : 20160710 } } },
{ "$match" : { "TimestampHash" : { "$lte" : 20190909 } } },
{ "$group" : { "_id" : { "Device" : "$Device" }, "__agg0" : { "$sum" : 1 } } },
{ "$project" : { "Device" : "$_id.Device", "Count" : "$__agg0", "_id" : 0 } },
{ "$skip" : 0 },
{ "$limit" : 10 }])
Update:
I used 'allowDiskUse: true' and the problem was solved. Reduced to 4 seconds for 3M data filtered.
I have encounter a similar problem before, during this question, and to be honest, I guess Cassandra is better in your certain case, but the question was about Mongo aggregation query optimization, right?
As for now, one of my collections have more then 3M+ docs, and shouldn't take 24s for aggregation queries if you build indexes correctly.
First of all, check out the index usage via Mongo Compass. Does Mongo actually using it? If your app spam queries to DB and your index have 0 usage (as in example below) than, as you already guessed it, something wrong with your index.
The Second thing is, using explain method (this doc will help you out), to check out more info about your query.
And for the third: index field sorting matters. For example if you have $match stage with 3 fields and you request docs by fields:
{ $match: {a_field:a, b_field:b, c_field:c} }
then you should build compound index on a,b,c fields in the exact same order.
There is always some kind of DB architecture problem. I highly recommend you not to stockpile all data inside one collection. Using {timestamps:true} on insertion (it created two fields, like createdAt: and updatedAt:
{
timestamps: true
}
in your schema, store old-time/outdated data in different collection and use $lookup aggregation method for them, when you really needs to operate with them them.
Hope you'll find something useful in my answer.
Field is added but then disappears.
Here is the code from within the mongo shell:
> db.users.aggregate([{$addFields:{totalAge:{$sum:"$age"}}}])
{ "_id" : ObjectId("5acb81b53306361018814849"), "name" : "A", "age" : 1, "totalAge" : 1 }
{ "_id" : ObjectId("5acb81b5330636101881484a"), "name" : "B", "age" : 2, "totalAge" : 2 }
{ "_id" : ObjectId("5acb81b5330636101881484b"), "name" : "C", "age" : 3, "totalAge" : 3 }
> db.users.find().pretty()
{ "_id" : ObjectId("5acb81b53306361018814849"), "name" : "A", "age" : 1 }
{ "_id" : ObjectId("5acb81b5330636101881484a"), "name" : "B", "age" : 2 }
{ "_id" : ObjectId("5acb81b5330636101881484b"), "name" : "C", "age" : 3 }
Aggregation only reads data from your collection; it does not edit the collection too. The best way to think about aggregation is that you read some data and manipulate it for your immediate usage.
If you want change it in main source then you must use the update method.
Or an easier way (Not best but easy)
db.users.aggregate([{$addFields:{totalAge:{$sum:"$age"}}}]).forEach(function (x){
db.users.save(x)
})
Nozar's answer was correct, but .save() is now deprecated.
Instead of using his/her exact answer, modify it by using .updateOne and $set.
Old/deprecated answer:
db.users
.aggregate([{$addFields:{totalAge:{$sum:"$age"}}}])
.forEach(function (x){db.users.save(x)})
New/working answer:
db.users
.aggregate([{$addFields:{totalAge:{$sum:"$age"}}}])
.forEach(function (x){db.users.updateOne({name: x.name}, {$set: {totalAge: x.totalAge}})})
Note: In my example, I'm using 'name' to filter (essentially match, in this case) the documents in 'users' collection, but you can use any unique field (e.g. _id field).
Shoutout to Nozar for leading me to the updated answer I provided, as I just used it in my project! (In my project, I used a $match pipeline stage prior to the $addFields pipeline stage, as I was just looking to do this to a single document in my collection, rather than all documents)
The reason is that, your both approach totally different.In aggregation, you have use $addFields in that query you will get a totalAge. But according to your find query, you can get specific data which you have stored in a database.Here you did not calculate totalAge.
I hope you can understand it.
The aggregation pipeline doesn't alter the original data; what it does is to take a temporary in-memory copy of the data and perform a sequence of manipulations on it (still in-memory) and send it to the client.
It's similar to the way you can do db.collection.find().sort() ; the sorting there only changes what is being returned to the client, it doesn't change what is stored in the database.
The only exception is when you use the $out stage, which saves the result of the aggregation to another collection. You can see that, because they had to add a special type of stage to do this, that a normal aggregation does not write back to the stored data.
I have a test object which works as nodes on a tree, containing 0 or more children instances of the same type. I'm persisting it on MongoDB and querying it with Morphia.
I perform the following query:
db.TestObject.find( {}, { _id: 1, childrenTestObjects: 1 } ).limit(6).sort( {_id: 1 } ).pretty();
Which results in:
{ "_id" : NumberLong(1) }
{ "_id" : NumberLong(2) }
{ "_id" : NumberLong(3) }
{ "_id" : NumberLong(4) }
{
"_id" : NumberLong(5),
"childrenTestObjects" : [
{
"stringValue" : "6eb887126d24e8f1cd8ad5033482c781",
"creationDate" : ISODate("1997-05-24T00:00:00Z")
"childrenTestObjects" : [
{
"stringValue" : "2ab8f86410b4f3bdcc747699295eb5a4",
"creationDate" : ISODate("2024-10-10T00:00:00Z"),
"_id" : NumberLong(7)
}
],
"_id" : NumberLong(6)
}
]
}
That's awesome, but also a little surprising. I'm having two issues with the results:
1) When I do a projection, it only applies to the top elements. The children elements still return other properties not in the projection (stringValue and creationDate). I'd like the field selection to apply to all documents and sub documents of the same type. This tree has an undermined number of sub items, so I can't specify that in the query explicitly. How to accomplish that?
2) To my surprise, limit applied to sub documents! You see that there was one embedded document with id 6. I was expecting to see 6 top level documents with N sub documents, but instead got just 5. How to tell MongoDB to return 6 top level elements, regardless of what is embedded in them? Without that having a consistent pagination system is impossible.
All your help has made learning MongoDB way faster and I really appreciate it! Thanks!
As for 1), projections retain fields in the results. In this case that field is childrenTestObjects which happens to be a document. So mongo returns that entire field which is, of course, the entire subdocument. Projections are not recursive so you'd have to specify each field explicitly.
As for 2), that doesn't sound right. it would help to see the query results without the projections added (full documents in each return document) and we can take it from there.
I'm quite new to MongoDb and I'm trying to get my head around the grouping/counting functions. I want to retrieve a list of votes per track ID, ordered by their votes. Can you help me make sense of this?
List schema:
{
"title" : "Bit of everything",
"tracks" : [
ObjectId("5310af6518668c271d52aa8d"),
ObjectId("53122ffdc974dd3c48c4b74e"),
ObjectId("53123045c974dd3c48c4b74f"),
ObjectId("5312309cc974dd3c48c4b750")
],
"votes" : [
{
"track_id" : "5310af6518668c271d52aa8d",
"user_id" : "5312551c92d49d6119481c88"
},
{
"track_id" : "53122ffdc974dd3c48c4b74e",
"user_id" : "5310f488000e4aa02abcec8e"
},
{
"track_id" : "53123045c974dd3c48c4b74f",
"user_id" : "5312551c92d49d6119481c88"
}
]
}
I'm looking to generate the following result (ideally ordered by the number of votes, also ideally including those with no entries in the votes array, using tracks as a reference.):
[
{
track_id: 5310af6518668c271d52aa8d,
track_votes: 1
},
{
track_id: 5312309cc974dd3c48c4b750,
track_votes: 0
}
...
]
In MySQL, I would execute the following
SELECT COUNT(*) AS track_votes, track_id FROM list_votes GROUP BY track_id ORDER BY track_votes DESC
I've been looking into the documentation for the various aggregation/reduce functions, but I just can't seem to get anything working for my particular case.
Can anyone help me get started with this query? Once I know how these are structured I'll be able to apply that knowledge elsewhere!
I'm using mongodb version 2.0.4 on Ubuntu 12, so I don't think I have access to the aggregation functions provided in later releases. Would be willing to upgrade, if that's the general consensus.
Many thanks in advance!
I recommend you to upgrade your MongoDB version to 2.2 and use the Aggregation Framework as follows:
db.collection.aggregate(
{ $unwind:"$votes"},
{ $group : { _id : "$votes.track_id", number : { $sum : 1 } } }
)
MongoDB 2.2 introduced a new aggregation framework, modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result.
The output will look like this:
{
"result":[
{
"_id" : "5310af6518668c271d52aa8d", <- ObjectId
"number" : 2
},
...
],
"ok" : 1
}
If this is not possible to upgrade, I recommend doing it in a programmatically way.
I hate this kind of questions but maybe you can point me to obvious. I'm using Mongo 2.2.2.
I have a collection (in replica set) with 6M documents which has string field called username on which I have index. The index was non-unique but recently I made it unique. Suddenly following query gives me false alarms that I have duplicates.
db.users.aggregate(
{ $group : {_id : "$username", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} } );
which returns
{
"result" : [
{
"_id" : "davidbeges",
"total" : 2
},
{
"_id" : "jesusantonio",
"total" : 2
},
{
"_id" : "elesitasweet",
"total" : 2
},
{
"_id" : "theschoolofbmx",
"total" : 2
},
{
"_id" : "longflight",
"total" : 2
},
{
"_id" : "thenotoriouscma",
"total" : 2
}
],
"ok" : 1
}
I tested this query on sample collection with few documents and it works as expected.
One of 10gen responded in their JIRA.
Are there any updates on this collection? If so, I'd try adding {$sort: {username:1}} to the front of the pipeline. That will ensure that you only see each username once if it is unique.
If there are updates going on, it is possible that aggregation would see a document twice if it moves due to growth. Another possibility is that a document was deleted after being seen by the aggregation and a new one was inserted with the same username.
So sorting by username before grouping helped.
I think the answer may lie in the fact that your $group is not using an index, it's just doing a scan over the entire collection. These operators can use and index currently in the aggregation framework:
$match $sort $limit $skip
And they will work if placed before:
$project $unwind $group
However, $group by itself will not use an index. When you do your find() test I am betting you are using the index, possibly as a covered index (you can verify by looking at an explain() for that query), rather than scanning the collection. Basically my theory is that your index has no dupes, but your collection does.
Edit: This likely happens because a document is updated/moved during the aggregation operation and hence is seen twice, not because of dupes in the collection as originally thought.
If you add an operator earlier in the pipeline that can use the index but not alter the results fed into $group, then you can avoid the issue.