Best way to doing this in mongodb? - mongodb

db.post
{user:'50001', content:...,},
{user:'50002', content:...,},
{user:'50003', content:...,},
{user:'50001', content:...,},
{user:'50002', content:...,},
db.catagory
{user:'50001', type:'public',...}
{user:'50002', type:'vip',...}
{user:'50003', type:'public',...}
What I want, is pickup the post of user what type:public.
I can do it like:
publicUsers = db.catagory.distinct(user, {type:'public'})
db.post.find({user: {$in: publicUsers }}).sort({_id:-1})
or use lookup in aggregate.
and output
{user:'50001', content:...,},
{user:'50003', content:...,},
{user:'50001', content:...,},
Is that some other better way to do this faster?
Consider of the large request situation, should I create a new db like post_public for finding?

The aggregation pipeline is your best bet. It's quite performant, as it's intended for just these types of situations.
Another Edit
Ok, this is the best solution I have, but it has some limitations set by Mongo.
Firstly, a sort operation on non-indexed fields cannot use more than 100MB of RAM. That's quite a bit of space for text based posts, but something to be aware of (there's a workaround that uses disk space if needed, but it will slow down the operation).
Second, and more relevant in this case, Mongo sets a hard cap on the size of a single document to 16MB. Given that you wanted the posts in a single array, a large dataset can potentially pass that if no limits on the output are imposed. If you decide you don't want all posts in a single array, by removing the last three stages of the pipeline ($group, $addFields, and $project), the result will be the sorted posts each within their own document. You will need to limit the number of documents manually to get the n number of posts you want if you go that route. IMO, you should do this. Those last three stages add additional processing burden but offer very little practical benefit on the output. The only difference is iterating over a nested array, vs iterating over the root array.
db.categories.aggregate([
{$project: {
user: 1,
type: 1
}},
{$match: {
type: "public"
}},
{$lookup: {
from: 'posts',
localField: 'user',
foreignField: 'user',
as: 'userPosts'
}},
{$unwind: {
path: "$userPosts",
preserveNullAndEmptyArrays: false
}},
{$sort: {
"userPosts.posted": -1,
}},
{$group: {
_id: "public",
posts: {
$push: "$userPosts"
}
}},
{$addFields: {
"latestPosts": {
$slice: ["$posts", 5]
}
}},
{$project: {
posts: 0
}}
]);
You'll access the resulting posts like so:
.then(result => {
let latestPosts = result[0].latestPosts;
});
Edit
I built a mock database and came up with this solution:
db.categories.aggregate([{
$project: {
_id: 0,
user: 1,
type: 1,
}
}, {
$match: {
type: "public"
}
}, {
$group: {
_id: {
$concat: ["$user", "-", "$type"]
},
user: {
$first: "$user"
},
type: {
$first: "$type"
}
}
}, {
$sort: {
user: -1
}
}, {
$limit: 10
}, {
$lookup: {
from: 'posts',
localField: 'user',
foreignField: 'user',
as: 'userPosts'
}
}]);
The result will be an array of objects representing each unique user-type combo with the type specified in the $match filter. The object will contain a child array with all the posts made by that user. All the users will be sorted ind descending order and limited to ten.
If all you want is a single operator to get distinct values from categories:
db.categories.aggregate([{$project: {
_id: 0,
user: 1,
type: 1
}}, {$group: {
_id: {$concat: ["$user","-","$type"]},
user: {$first: "$user"},
type: {$first: "$type"}
}}]);
The goal is to first filter out the users with non-public types so they don't appear in the result, join the posts belonging to that user to a new field called "user_posts", and then sort all the users in ascending order. I don't know the entire structure of your database so my specific example may not be the best fit, but it would be a good starting point.
Edit
Some other ways to increase performance for large amounts of data:
By querying in pieces. Say for example you're displaying paginated user posts. Then you would only need to query the posts for the page currently being viewed. You can reduce any noticeable loading times by pre-fetching the next few pages as well.
Using a "virtual" list. This will display every post in one long list, but only fetch a certain size chunk that is currently being viewed, or will soon be viewed. Similar to the first pagination example. If you're familiar with the RecyclerView in Android development, that's the idea.
Restructuring the way posts are stored in your database. It would be a much faster operation if Mongo didn't need to iterate through so much unrelated data when looking for posts of a specific user or type. The solution is to store posts nested below a common ancestor, such as their poster, or their view type. This does mean that you may have duplicate entries spread between documents. This is one of the inherent downsides of a no-SQL database. You will need to find a solution for handling synchronizing the documents spread throughout the database.
Creating a new index on document fields that you use to sort your array. Indices help Mongo organize the documents internally for more efficient searches.
Running a heavy query once, then $out-ing the result. In this way you can store the result of expensive operations in a more permanent fashion than a simple cache.
A variant of number 5 is to $out the result of a large sorting operation, and then continuing to insert new entries at the end of the now sorted collection.

Finally, I do it like this.
db.post.aggregate([
{$match:{
uid:{$in: db.catagory.distinct('id',{type:'public'})} # directly write here
_id:{$lt: ObjectId("5c6d5494b9f43f020d4122a0")} # for turn the page
}},
{$sort:{_id:-1}}, # sort by post _id
{$limit:20} # page size
])
It's faster than lookup or others.

Related

Can you have a collection that's randomly distributed in mongodb?

I have a collection that's essentially just a collection of unique IDs, and I want to store them randomly distributed so I can quickly just findOne instead of sampling them since it's quicker than aggregation.
I ran the following aggregation to sort it randomly:
db.my_coll.aggregate([{"$sample": {"size": 1200000}}, {"$out": {db: "db", coll: "my_coll"}}], {allowDiskUse: true})
it seems to work?
db.my_coll.find():
{ _id: 581848, schema_version: 1 },
{ _id: 1184557, schema_version: 1 },
{ _id: 213688, schema_version: 1 },
....
Is this allowed? I thought _id is a default index, and it should always be sorted by the index. I'm only ever removing elements from this collection, so it's fine if they don't get inserted randomly, but I don't know if this is just a hack that might at some point behave differently.

MongoDB Event Driven Database Design

Goal
Zero Conflict System: Having this be a write-only system would save us from conflicts. People are creating and updating documents both offline and online and being able to figure out what update trumps what is important.
Deep Historical reference: I want to know at any period of time, what that document looked like. On top of that, I need a deep historical analysis of how each item changes over time.
I was thinking of the following architecture:
Reference Document
_id: "u12345",
type: "user",
createdAt: 1584450565 //UNIX TIMESTAMP
{
_id: "<random>"
type: "user-name-revision" //{type}-{key}-Revision
referenceId: "u12345"
value: "John Doe Boy"
updatedAt: 1584450565
}
{
_id: "<random>"
type: "user-name-revision"
referenceId: "u12345"
value: "John Doe"
updatedAt: 1584450566 // 1 second higher than the above
}
{
_id: "<random>"
type: "user-email-revision"
referenceId: "u12345"
value: "john#gmail.com"
updatedAt: 1584450565
}
If you want to get the user, you would:
Get all documents with referenceId of u12345.
Only get the most recent of each type
Then combine and output the user like so:
_id: "u12345",
type: "user",
createdAt: 1584450565,
name: "John Doe"
email: "john#gmail.com"
updatedAt: 1584450566 // highest timestamp
The only issue I see is if I wanted to sort all users by name let's say - If I have 1000 users, I don't see a clean way of doing this.
I was wondering if anyone had any suggestions for a pattern I could use. I'm using MongoDB so I have the power of that at my disposal.
You can try below aggregation.
Project the key field from the type field, sort by updatedAt and group to pick latest value and keep the reference and updatedAt.
Group all documents and merge the different key values and keep the updatedAt and post processing to format the document.
Lookup to pull in user value and followed by replaceRoot to merge the main document with lookup document.
Sort the documents by name.
db.collectionname.aggregate([
{"$addFields":{"key":{"$arrayElemAt":[{"$split":["$type","-"]},1]}}},
{"$sort":{"updatedAt":-1}},
{"$group":{
"_id":{"referenceId":"$referenceId","key:"$key"},
"value":{"$first":"$$ROOT"},
"referenceId":{"$first":"$referenceId"},
"updatedAt":{"$first":"$updatedAt"}
}},
{"$sort":{"updatedAt":-1}},
{"$group":{
"_id":"$_id.referenceId",
"data":{
"$mergeObjects":{"$arrayToObject":[[["$_id.key","$value"]]]}
},
"updatedAt":{"$first":"$updatedAt"}
}},
{"$addFields":{
"data.referenceId":"$referenceId",
"data.updatedAt":"$updatedAt"
}},
{"$project":{"data":1}},
{"$lookup":{
"from":"othercollectionname",
"localField":"data.referenceId",
"foreignField":"_id",
"as":"reference"
}},
{"$replaceRoot":{
"newRoot":{
"$mergeObjects":[{"$arrayElemAt":["$reference",0]},"$data"]}
}},
{"$project":{"_id":0}},
{"$sort":{"name":1}}
])
Alternate approach:
With all the transformation your query will be little slower. You can make few tweaks.
Input
{
_id: "<random>"
type: "user",
key: "name"
referenceId: "u12345"
value: "John Doe Boy"
updatedAt: 1584450565
}
Query
db.collectionname.aggregate([
{"$sort":{"updatedAt":-1}},
{"$group":{
"_id":{"referenceId":"$referenceId","key":"$key"},
"top":{"$first":"$$ROOT"}
}},
{"$sort":{"top.updatedAt":-1}},
{"$group":{
"_id":"$_id.referenceId",
"max":{"$max":{"$cond":[{"$eq":["$key", "name"]},"$top.value",null]}},
"key-values":{"$push":{"k":"$_id.key","v":"$top.value"}},
"updatedAt":{"$first":"$top.updatedAt"}
}},
{"$lookup":{
"from":"othercollectionname",
"localField":"_id",
"foreignField":"_id",
"as":"reference"
}},
{"$project":{"_id":0}},
{"$sort":{"max":1}}
])
We can refine our schema further to remove few other stages. We make sure we add the latest value at the end of array. Something like
Input
{
_id: "<random>"
type: "user",
key: "name"
referenceId: "u12345"
updates:[
{"value": "John Doe Boy", updatedAt: 1584450565},
{"value": "John Doe", updatedAt: 1584450566}
]
}
Query
db.collectionname.aggregate([
{"$addFields":{"latest":{"$arrayElemAt":["$updates",-1]}}},
{"$group":{
"_id":"$referenceId",
"max":{"$max":{"$cond":[{"$eq":["$key", "name"]},"$latest.value",null]}},
"updatedAt":{"$first":"$updatedAt"}
"key-values":{"$push":{"k":"$key","v":"$latest.value"}},
"updatedAt":{"$first":"$latest.updatedAt"}
}},
{"$lookup":{
"from":"othercollectionname",
"localField":"_id",
"foreignField":"_id",
"as":"reference"
}},
{"$project":{"_id":0}},
{"$sort":{"max":1}}
])
Your question does not have enough requirements for a specific answer, so I'll try to give an answer that should cover many cases.
I doubt you'll find detailed published use cases, however, I can give you a few tips from my personal experience.
High throughput:
If you are using a high throughput event streaming, it would be better to store you data in an event log, where IDs are not unique and there are no updates, only inserts. This could be done for instance with Kafka which is meant to be used for event streaming. You could then process the events in bulks into a searchable database e.g. MongoDB.
Low throughput:
For a lower throughput, you could insert documents directly into MongoDB, however, still only insert, not update data.
Storing data in a event-log style in MongoDB:
In both cases, within MongoDB, you'll want a random _id (e.g. UUID), so each event has a unique _id. To access a logical document, you'll need another field, e.g. docId, which along with eventTimestamp will be indexed (with eventTimestamp sorted desc for faster access to latest version).
Searching:
To search by other fields, you can use additional indexes, as necessary, however, if your searches take significant CPU time, make sure you only run them against secondary instances of MongoDB (secondayOnly), so that the event inserts won't get delayed. Make yourself familiar with MongoDB's aggregation pipeline.
To prevent invalid states due to out-of-order updates:
Since you want to enable updates, you should consider only saving the changes in each document, e.g. +1 to field A, set value to x for field B. In this case you will need to have an index with docId and ascending eventTimestamp instead and every now and then aggregate your events into summary documents in a different collection, to enable faster reading of the latest state. Use the eventTimestamp of the latest document per docId for the aggregated document, plus the aggregationTimestamp and versionCount. If at any point you receive a document with an eventTimestamp lower than the latest eventTimestamp in the aggregated collection, you'll need to partially recalculate that collection. In other cases, you can update the aggregated collection incrementually.
Use this you will get desired output, make sure you have indexed in the referencedId and updatedAt and enough memory to sort.
db.columnName.aggregate([
{
$match:{
referenceId:"u12345"
}
},
{
$project:{
type: { $arrayElemAt: [ {$split: [ "$type", "-" ]}, 0 ] },
referenceId:true,
createdAt:true,
name:true,
email:true,
updatedAt:true
}
},
},
{
$sort:{
updatedAt:-1
}
},
{
$group:{
_id:"$referenceId",
type:{
$first:"$type"
},
createdAt:{
$last:"$updatedAt"
},
name:{
$first:"$name"
},
email:{
$first:"$email"
},
updatedAt:{
$first:"$updatedAt"
}
}
}
])

MongoDB: update nested value in a collection based on existing field value

I want to update nested _ids over an entire collection IF they are of a type string.
If I have object that look like this...
user : {
_id: ObjectId('234wer234wer234wer'),
occupation: 'Reader',
books_read: [
{
title: "Best book ever",
_id: "123qwe234wer345ert456rty"
},
{
title: "Worst book ever",
_id: "223qwe234wer345ert456rty"
},
{
title: "A Tail of Two Cities",
_id: ObjectId("323qwe234wer345ert456rty")
}
]
}
and I want to change the type of the _Ids from string to ObjectId
how would I do that.??
I have done "this" in the past...But this is working on NON-nested item - I need to change a nested value
db.getCollection('users')
.find({
$or: [
{occupation:{$exists:false}},
{occupation:{$eq:null}}
]
})
.forEach(function (record) {
record.occupation = 'Reader';
db.users.save(record);
});
Any help - I am trying to avoid writing a series of loop on the app server to make db calls - so I am hoping for something directly in 'mongo'
There isn't a way of doing (non $rename) updates operations on a document while referencing existing fields -- MongoDB: Updating documents using data from the same document
So, you'll need to write a script (similar to the one you posted with find & each) to recreate those documents with the correct _id type. To find the subdocuments to update you can use the $type operator. A query like db.coll.find({nestedField._id: {$type: 'string' }}) should find all the full documents that have bad subdocuments, or you could do an aggregation query with $match & $unwind to only get the subdocuments
db.coll.aggregate([
{ $match: {'nestedField._id': {$type: 'string' }}}, // limiting to documents that have any bad subdocuments
{ $unwind: '$nestedField'}, // creating a separate document in the pipeline for each entry in the array
{ $match: {'nestedField._id': {$type: 'string' }}}, // limiting to only the subdocuments that have bad fields
{ $project: { nestedId: 'nestedField._id' }} // output will be: {_id: documentedId, nestedId }
])
I am trying to avoid writing a series of loop on the app server to make db calls - so I am hoping for something directly in 'mongo'
You can run js code directly on the mongo to avoid making api calls, but I don't think there's any way to avoid looping over the documents.

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".

Combine two fields from different documents in Mongodb

I have these documents in a collection :
{topic : "a",
messages : [ObjectId("21312321321323"),ObjectId("34535345353"),...]
},
{topic : "b,
messages : [ObjectId("1233232323232"),ObjectId("6556565656565"),...]
}
Is there a posibility to get a result with the combination of messages fields ? I like to get this for example :
{[
ObjectId(""),ObjectId(""),ObjectId(""),ObjectId("")
]}
I thought that this was possible with MapReduce but in my case the documents doesn't have anything in common. Right now I'm doing this in the backend using javascript and loops, but i think that this isn't the best option. Thanks.
You could use the $group operator in the Aggregation Framework. To use the Aggregation Framework you will want to be sure you're running on MongoDB 2.2 or newer, of course.
If used with $push you will get all the lists of messages concatenated together.
db.myCollection.aggregate({ $group: { messages: { $push: '$messages' } } });
If used with $addToSet you will get only the distinct values.
db.myCollection.aggregate({ $group: { messages: { $addToSet: '$messages' } } });
And if you want to filter down the candidate documents first, you can use $match.
db.myCollection.aggregate([
{ $match: { topic: { $in: [ 'a', 'b' ] } } },
{ $group: { matches: { $sum: 1 }, messages: { $push: '$messages' } } }
]);
One option is to use the aggregation framework.
However, if you're planning on having a large number of results (beyond just a "lightweight" result), a result document exceeding 16MB in size, or using excessive system memory, you'll need to just loop through the objects in the collection and concatenate the results manually (as you suggest you might be doing now) or risk mongodb throwing an exception.
Aggregation limits may be found at the bottom of this page:
http://docs.mongodb.org/manual/applications/aggregation/
Given the limitations, you may want to just use find with a projection to return just messages.
(And with anything like this, I'd strongly recommend you do some performance benchmarks to compare options with your data on your servers as the "Internet" would suggest right now that some people have found the Aggregation support to be slower than other techniques).