MongoDB Event Driven Database Design - mongodb

Goal
Zero Conflict System: Having this be a write-only system would save us from conflicts. People are creating and updating documents both offline and online and being able to figure out what update trumps what is important.
Deep Historical reference: I want to know at any period of time, what that document looked like. On top of that, I need a deep historical analysis of how each item changes over time.
I was thinking of the following architecture:
Reference Document
_id: "u12345",
type: "user",
createdAt: 1584450565 //UNIX TIMESTAMP
{
_id: "<random>"
type: "user-name-revision" //{type}-{key}-Revision
referenceId: "u12345"
value: "John Doe Boy"
updatedAt: 1584450565
}
{
_id: "<random>"
type: "user-name-revision"
referenceId: "u12345"
value: "John Doe"
updatedAt: 1584450566 // 1 second higher than the above
}
{
_id: "<random>"
type: "user-email-revision"
referenceId: "u12345"
value: "john#gmail.com"
updatedAt: 1584450565
}
If you want to get the user, you would:
Get all documents with referenceId of u12345.
Only get the most recent of each type
Then combine and output the user like so:
_id: "u12345",
type: "user",
createdAt: 1584450565,
name: "John Doe"
email: "john#gmail.com"
updatedAt: 1584450566 // highest timestamp
The only issue I see is if I wanted to sort all users by name let's say - If I have 1000 users, I don't see a clean way of doing this.
I was wondering if anyone had any suggestions for a pattern I could use. I'm using MongoDB so I have the power of that at my disposal.

You can try below aggregation.
Project the key field from the type field, sort by updatedAt and group to pick latest value and keep the reference and updatedAt.
Group all documents and merge the different key values and keep the updatedAt and post processing to format the document.
Lookup to pull in user value and followed by replaceRoot to merge the main document with lookup document.
Sort the documents by name.
db.collectionname.aggregate([
{"$addFields":{"key":{"$arrayElemAt":[{"$split":["$type","-"]},1]}}},
{"$sort":{"updatedAt":-1}},
{"$group":{
"_id":{"referenceId":"$referenceId","key:"$key"},
"value":{"$first":"$$ROOT"},
"referenceId":{"$first":"$referenceId"},
"updatedAt":{"$first":"$updatedAt"}
}},
{"$sort":{"updatedAt":-1}},
{"$group":{
"_id":"$_id.referenceId",
"data":{
"$mergeObjects":{"$arrayToObject":[[["$_id.key","$value"]]]}
},
"updatedAt":{"$first":"$updatedAt"}
}},
{"$addFields":{
"data.referenceId":"$referenceId",
"data.updatedAt":"$updatedAt"
}},
{"$project":{"data":1}},
{"$lookup":{
"from":"othercollectionname",
"localField":"data.referenceId",
"foreignField":"_id",
"as":"reference"
}},
{"$replaceRoot":{
"newRoot":{
"$mergeObjects":[{"$arrayElemAt":["$reference",0]},"$data"]}
}},
{"$project":{"_id":0}},
{"$sort":{"name":1}}
])
Alternate approach:
With all the transformation your query will be little slower. You can make few tweaks.
Input
{
_id: "<random>"
type: "user",
key: "name"
referenceId: "u12345"
value: "John Doe Boy"
updatedAt: 1584450565
}
Query
db.collectionname.aggregate([
{"$sort":{"updatedAt":-1}},
{"$group":{
"_id":{"referenceId":"$referenceId","key":"$key"},
"top":{"$first":"$$ROOT"}
}},
{"$sort":{"top.updatedAt":-1}},
{"$group":{
"_id":"$_id.referenceId",
"max":{"$max":{"$cond":[{"$eq":["$key", "name"]},"$top.value",null]}},
"key-values":{"$push":{"k":"$_id.key","v":"$top.value"}},
"updatedAt":{"$first":"$top.updatedAt"}
}},
{"$lookup":{
"from":"othercollectionname",
"localField":"_id",
"foreignField":"_id",
"as":"reference"
}},
{"$project":{"_id":0}},
{"$sort":{"max":1}}
])
We can refine our schema further to remove few other stages. We make sure we add the latest value at the end of array. Something like
Input
{
_id: "<random>"
type: "user",
key: "name"
referenceId: "u12345"
updates:[
{"value": "John Doe Boy", updatedAt: 1584450565},
{"value": "John Doe", updatedAt: 1584450566}
]
}
Query
db.collectionname.aggregate([
{"$addFields":{"latest":{"$arrayElemAt":["$updates",-1]}}},
{"$group":{
"_id":"$referenceId",
"max":{"$max":{"$cond":[{"$eq":["$key", "name"]},"$latest.value",null]}},
"updatedAt":{"$first":"$updatedAt"}
"key-values":{"$push":{"k":"$key","v":"$latest.value"}},
"updatedAt":{"$first":"$latest.updatedAt"}
}},
{"$lookup":{
"from":"othercollectionname",
"localField":"_id",
"foreignField":"_id",
"as":"reference"
}},
{"$project":{"_id":0}},
{"$sort":{"max":1}}
])

Your question does not have enough requirements for a specific answer, so I'll try to give an answer that should cover many cases.
I doubt you'll find detailed published use cases, however, I can give you a few tips from my personal experience.
High throughput:
If you are using a high throughput event streaming, it would be better to store you data in an event log, where IDs are not unique and there are no updates, only inserts. This could be done for instance with Kafka which is meant to be used for event streaming. You could then process the events in bulks into a searchable database e.g. MongoDB.
Low throughput:
For a lower throughput, you could insert documents directly into MongoDB, however, still only insert, not update data.
Storing data in a event-log style in MongoDB:
In both cases, within MongoDB, you'll want a random _id (e.g. UUID), so each event has a unique _id. To access a logical document, you'll need another field, e.g. docId, which along with eventTimestamp will be indexed (with eventTimestamp sorted desc for faster access to latest version).
Searching:
To search by other fields, you can use additional indexes, as necessary, however, if your searches take significant CPU time, make sure you only run them against secondary instances of MongoDB (secondayOnly), so that the event inserts won't get delayed. Make yourself familiar with MongoDB's aggregation pipeline.
To prevent invalid states due to out-of-order updates:
Since you want to enable updates, you should consider only saving the changes in each document, e.g. +1 to field A, set value to x for field B. In this case you will need to have an index with docId and ascending eventTimestamp instead and every now and then aggregate your events into summary documents in a different collection, to enable faster reading of the latest state. Use the eventTimestamp of the latest document per docId for the aggregated document, plus the aggregationTimestamp and versionCount. If at any point you receive a document with an eventTimestamp lower than the latest eventTimestamp in the aggregated collection, you'll need to partially recalculate that collection. In other cases, you can update the aggregated collection incrementually.

Use this you will get desired output, make sure you have indexed in the referencedId and updatedAt and enough memory to sort.
db.columnName.aggregate([
{
$match:{
referenceId:"u12345"
}
},
{
$project:{
type: { $arrayElemAt: [ {$split: [ "$type", "-" ]}, 0 ] },
referenceId:true,
createdAt:true,
name:true,
email:true,
updatedAt:true
}
},
},
{
$sort:{
updatedAt:-1
}
},
{
$group:{
_id:"$referenceId",
type:{
$first:"$type"
},
createdAt:{
$last:"$updatedAt"
},
name:{
$first:"$name"
},
email:{
$first:"$email"
},
updatedAt:{
$first:"$updatedAt"
}
}
}
])

Related

Can you have a collection that's randomly distributed in mongodb?

I have a collection that's essentially just a collection of unique IDs, and I want to store them randomly distributed so I can quickly just findOne instead of sampling them since it's quicker than aggregation.
I ran the following aggregation to sort it randomly:
db.my_coll.aggregate([{"$sample": {"size": 1200000}}, {"$out": {db: "db", coll: "my_coll"}}], {allowDiskUse: true})
it seems to work?
db.my_coll.find():
{ _id: 581848, schema_version: 1 },
{ _id: 1184557, schema_version: 1 },
{ _id: 213688, schema_version: 1 },
....
Is this allowed? I thought _id is a default index, and it should always be sorted by the index. I'm only ever removing elements from this collection, so it's fine if they don't get inserted randomly, but I don't know if this is just a hack that might at some point behave differently.

Speed up aggregation on large collection

I currently have a database with about 270 000 000 documents. They look like this:
[{
'location': 'Berlin',
'product': 4531,
'createdAt': ISODate(...),
'value': 3523,
'minOffer': 3215,
'quantity': 7812
},{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}]
The database currently holds a bit over one month of data and has ~170 locations (in EU and US) with ~8000 products. These documents represent timesteps, so there are about ~12-16 entries per day, per product per location (at most 1 per hour though).
My goal is to retrieve all timesteps of a product in a given location for the last 7 days. For a single location this query works reasonable fast (150ms) with the index { product: 1, location: 1, createdAt: -1 }.
However, I also need these timesteps not just for a single location, but an entire region (so about 85 locations). I'm currently doing that with this aggregation, which groups all the entries per hour and averages the desired values:
this.db.collection('...').aggregate([
{ $match: { { location: { $in: [array of ~85 locations] } }, product: productId, createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) } } }, {
$group: {
_id: {
$toDate: {
$concat: [
{ $toString: { $year: '$createdAt' } },
'-',
{ $toString: { $month: '$createdAt' } },
'-',
{ $toString: { $dayOfMonth: '$createdAt' } },
' ',
{ $toString: { $hour: '$createdAt' } },
':00'
]
}
},
value: { $avg: '$value' },
minOffer: { $avg: '$minOffer' },
quantity: { $avg: '$quantity' }
}
}
]).sort({ _id: 1 }).toArray()
However, this is really really slow, even with the index { product: 1, createdAt: -1, location: 1 } (~40 secs). Is there any way to speed up this aggregation so it goes down to a few seconds at most? Is this even possible, or should I think about using something else?
I've thought about saving these aggregations in another database and just retrieving that and aggregating the rest, this is however really awkward for the first users on the site who have to sit 40 secs through waiting.
These are some ideas which can benefit the querying and performance. Whether all these will work together is matter of some trials and testing. Also, note that changing the way data is stored and adding new indexes means that there will changes to application, i.e., capturing data, and the other queries on the same data need to be carefully verified (that they are not affected in a wrong way).
(A) Storing a Day's Details in a Document:
Store (embed) a day's data within the same document as an array of sub-documents. Each sub-document represents an hour's entry.
From:
{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}
to:
{
location: 'London',
product: 1231,
createdAt: ISODate(...),
details: [ { value: 53523, minOffer: 44215, quantity: 2812 }, ... ]
}
This means about ten entries per document. Adding data for an entry will be pushing data into the details array, instead of adding a document as in present application. In case the hour's info (time) is required it can also be stored as part of the details sub-document; it will entirely depend upon your application needs.
The benefits of this design:
The number of documents to maintain and query will reduce (per
product per day about ten documents).
In the query, the group stage will go away. This will be just a
project stage. Note that the $project supports accumulators $avg and $sum.
The following stage will create the sums and averages for the day (or a document).
{
$project: { value: { $avg: '$value' }, minOffer: { $avg: '$minOffer' }, quantity: { $avg: '$quantity' } }
}
Note the increase in size of the document is not much, with the amount of details being stored per day.
(B) Querying by Region:
The present matching of multiple locations (or a region) with this query filer: { location: { $in: [array of ~85 locations] } }. This filter says : location: location-1, -or- location: location-3, -or- ..., location: location-50. Adding a new field , region, will filter with one value matching.
The query by region will change to:
{
$match: {
region: regionId,
product: productId,
createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) }
}
}
The regionId variable is to be supplied to match with the region field.
Note that, both the queries, "by location" and "by region", will benefit with the above two considerations, A and B.
(C) Indexing Considerations:
The present index: { product: 1, location: 1, createdAt: -1 }.
Taking into consideration, the new field region, newer indexing will be needed. The query with region cannot benefit without an index on the region field. A second index will be needed; a compound index to suit the query. Creating an index with the region field means additional overhead on write operations. Also, there will be memory and storage considerations.
NOTES:
After adding the index, both the queries ("by location" and "by region") need to be verified using explain if they are using their respective indexes. This will require some testing; a trial-and-error process.
Again, adding new data, storing data in a different format, adding new indexes requires to consider these:
Careful testing and verifying that the other existing queries perform as usual.
The change in data capture needs.
Testing the new queries and verifying if the new design performs as expected.
Honestly your aggregation is pretty much as optimized as it can get, especially if you have { product: 1, createdAt: -1, location: 1 } as an index like you stated.
I'm not exactly sure how your entire product is built, however the best solution in my opinion is to have another collection containing just the "relevant" documents from the past week.
Then you could query that collection with ease, This is quite easy to do in Mongo as well using a TTL Index.
If this not an option you could add a temporary field to the "relevant" documents and query on that making it somewhat faster to retrieve them, but maintaining this field will require you to have a process running every X time which could make your results now 100% accurate depending when you decide to run it.

Best way to doing this in mongodb?

db.post
{user:'50001', content:...,},
{user:'50002', content:...,},
{user:'50003', content:...,},
{user:'50001', content:...,},
{user:'50002', content:...,},
db.catagory
{user:'50001', type:'public',...}
{user:'50002', type:'vip',...}
{user:'50003', type:'public',...}
What I want, is pickup the post of user what type:public.
I can do it like:
publicUsers = db.catagory.distinct(user, {type:'public'})
db.post.find({user: {$in: publicUsers }}).sort({_id:-1})
or use lookup in aggregate.
and output
{user:'50001', content:...,},
{user:'50003', content:...,},
{user:'50001', content:...,},
Is that some other better way to do this faster?
Consider of the large request situation, should I create a new db like post_public for finding?
The aggregation pipeline is your best bet. It's quite performant, as it's intended for just these types of situations.
Another Edit
Ok, this is the best solution I have, but it has some limitations set by Mongo.
Firstly, a sort operation on non-indexed fields cannot use more than 100MB of RAM. That's quite a bit of space for text based posts, but something to be aware of (there's a workaround that uses disk space if needed, but it will slow down the operation).
Second, and more relevant in this case, Mongo sets a hard cap on the size of a single document to 16MB. Given that you wanted the posts in a single array, a large dataset can potentially pass that if no limits on the output are imposed. If you decide you don't want all posts in a single array, by removing the last three stages of the pipeline ($group, $addFields, and $project), the result will be the sorted posts each within their own document. You will need to limit the number of documents manually to get the n number of posts you want if you go that route. IMO, you should do this. Those last three stages add additional processing burden but offer very little practical benefit on the output. The only difference is iterating over a nested array, vs iterating over the root array.
db.categories.aggregate([
{$project: {
user: 1,
type: 1
}},
{$match: {
type: "public"
}},
{$lookup: {
from: 'posts',
localField: 'user',
foreignField: 'user',
as: 'userPosts'
}},
{$unwind: {
path: "$userPosts",
preserveNullAndEmptyArrays: false
}},
{$sort: {
"userPosts.posted": -1,
}},
{$group: {
_id: "public",
posts: {
$push: "$userPosts"
}
}},
{$addFields: {
"latestPosts": {
$slice: ["$posts", 5]
}
}},
{$project: {
posts: 0
}}
]);
You'll access the resulting posts like so:
.then(result => {
let latestPosts = result[0].latestPosts;
});
Edit
I built a mock database and came up with this solution:
db.categories.aggregate([{
$project: {
_id: 0,
user: 1,
type: 1,
}
}, {
$match: {
type: "public"
}
}, {
$group: {
_id: {
$concat: ["$user", "-", "$type"]
},
user: {
$first: "$user"
},
type: {
$first: "$type"
}
}
}, {
$sort: {
user: -1
}
}, {
$limit: 10
}, {
$lookup: {
from: 'posts',
localField: 'user',
foreignField: 'user',
as: 'userPosts'
}
}]);
The result will be an array of objects representing each unique user-type combo with the type specified in the $match filter. The object will contain a child array with all the posts made by that user. All the users will be sorted ind descending order and limited to ten.
If all you want is a single operator to get distinct values from categories:
db.categories.aggregate([{$project: {
_id: 0,
user: 1,
type: 1
}}, {$group: {
_id: {$concat: ["$user","-","$type"]},
user: {$first: "$user"},
type: {$first: "$type"}
}}]);
The goal is to first filter out the users with non-public types so they don't appear in the result, join the posts belonging to that user to a new field called "user_posts", and then sort all the users in ascending order. I don't know the entire structure of your database so my specific example may not be the best fit, but it would be a good starting point.
Edit
Some other ways to increase performance for large amounts of data:
By querying in pieces. Say for example you're displaying paginated user posts. Then you would only need to query the posts for the page currently being viewed. You can reduce any noticeable loading times by pre-fetching the next few pages as well.
Using a "virtual" list. This will display every post in one long list, but only fetch a certain size chunk that is currently being viewed, or will soon be viewed. Similar to the first pagination example. If you're familiar with the RecyclerView in Android development, that's the idea.
Restructuring the way posts are stored in your database. It would be a much faster operation if Mongo didn't need to iterate through so much unrelated data when looking for posts of a specific user or type. The solution is to store posts nested below a common ancestor, such as their poster, or their view type. This does mean that you may have duplicate entries spread between documents. This is one of the inherent downsides of a no-SQL database. You will need to find a solution for handling synchronizing the documents spread throughout the database.
Creating a new index on document fields that you use to sort your array. Indices help Mongo organize the documents internally for more efficient searches.
Running a heavy query once, then $out-ing the result. In this way you can store the result of expensive operations in a more permanent fashion than a simple cache.
A variant of number 5 is to $out the result of a large sorting operation, and then continuing to insert new entries at the end of the now sorted collection.
Finally, I do it like this.
db.post.aggregate([
{$match:{
uid:{$in: db.catagory.distinct('id',{type:'public'})} # directly write here
_id:{$lt: ObjectId("5c6d5494b9f43f020d4122a0")} # for turn the page
}},
{$sort:{_id:-1}}, # sort by post _id
{$limit:20} # page size
])
It's faster than lookup or others.

Can I use populate before aggregate in mongoose?

I have two models, one is user
userSchema = new Schema({
userID: String,
age: Number
});
and the other is the score recorded several times everyday for all users
ScoreSchema = new Schema({
userID: {type: String, ref: 'User'},
score: Number,
created_date = Date,
....
})
I would like to do some query/calculation on the score for some users meeting specific requirement, say I would like to calculate the average of score for all users greater than 20 day by day.
My thought is that firstly do the populate on Scores to populate user's ages and then do the aggregate after that.
Something like
Score.
populate('userID','age').
aggregate([
{$match: {'userID.age': {$gt: 20}}},
{$group: ...},
{$group: ...}
], function(err, data){});
Is it Ok to use populate before aggregate? Or I first find all the userID meeting the requirement and save them in a array and then use $in to match the score document?
No you cannot call .populate() before .aggregate(), and there is a very good reason why you cannot. But there are different approaches you can take.
The .populate() method works "client side" where the underlying code actually performs additional queries ( or more accurately an $in query ) to "lookup" the specified element(s) from the referenced collection.
In contrast .aggregate() is a "server side" operation, so you basically cannot manipulate content "client side", and then have that data available to the aggregation pipeline stages later. It all needs to be present in the collection you are operating on.
A better approach here is available with MongoDB 3.2 and later, via the $lookup aggregation pipeline operation. Also probably best to handle from the User collection in this case in order to narrow down the selection:
User.aggregate(
[
// Filter first
{ "$match": {
"age": { "$gt": 20 }
}},
// Then join
{ "$lookup": {
"from": "scores",
"localField": "userID",
"foriegnField": "userID",
"as": "score"
}},
// More stages
],
function(err,results) {
}
)
This is basically going to include a new field "score" within the User object as an "array" of items that matched on "lookup" to the other collection:
{
"userID": "abc",
"age": 21,
"score": [{
"userID": "abc",
"score": 42,
// other fields
}]
}
The result is always an array, as the general expected usage is a "left join" of a possible "one to many" relationship. If no result is matched then it is just an empty array.
To use the content, just work with an array in any way. For instance, you can use the $arrayElemAt operator in order to just get the single first element of the array in any future operations. And then you can just use the content like any normal embedded field:
{ "$project": {
"userID": 1,
"age": 1,
"score": { "$arrayElemAt": [ "$score", 0 ] }
}}
If you don't have MongoDB 3.2 available, then your other option to process a query limited by the relations of another collection is to first get the results from that collection and then use $in to filter on the second:
// Match the user collection
User.find({ "age": { "$gt": 20 } },function(err,users) {
// Get id list
userList = users.map(function(user) {
return user.userID;
});
Score.aggregate(
[
// use the id list to select items
{ "$match": {
"userId": { "$in": userList }
}},
// more stages
],
function(err,results) {
}
);
});
So by getting the list of valid users from the other collection to the client and then feeding that to the other collection in a query is the onyl way to get this to happen in earlier releases.

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".