MongoDB pagination with document modified date - mongodb

I am implementing a stream of sort, which is storing data like below
{_id:ObjectId(), owner:1, text:"some text1", created:"2015-11-14T01:01:00+00:00", updated:"2015-11-14T01:01:00+00:00"}
{_id:ObjectId(), owner:1, text:"some text2", created:"2015-11-14T01:01:00+00:00", updated:"2015-11-14T01:01:00+00:00"}
{_id:ObjectId(), owner:1, text:"some text3", created:"2015-11-14T01:01:00+00:00", updated:"2015-11-14T01:02:00+00:00"}
...
{_id:ObjectId(), owner:1, text:"some text4", created:"2015-11-14T01:30:00+00:00", updated:"2015-11-14T01:31:00+00:00"}
...
Note that there is frequent bulk update, which means there can be more than 10-20 documents updated at the same time.
My aggregation query looks like this
db.stream.aggregate([
$match: {
'owner': 1,
'updated': { $lte: '2015-11-14T01:30:00+00:00' } },
$sort: { 'updated': -1 },
$limit: 10
])
If there are more than 10 objects with same updated date my pagination will not continue.
for pagination I have last document of previous page
I can extract _id and updated
Is it possible to do something like
$match : {updated :{ $lt: last_page_ele.updated} }
$sort : { updated }
$match : { `documents which comes after the last element in this order` }
$limit : 10
Update:
It is possible to use the method suggested by #Blakes Seven, i.e to store all the seen documents in a session array and use $nin. This was an obvious solution, but it screws up my front end behaviour. You cannot open multiple tabs, which will use the same session space used by the other tab.
Plus in my case storing 100s of items in RAM seems to be impractical and not scalable for lot of users.

Related

Mongodb selecting every nth of a given sorted aggregation

I want to be able to retrieve every nth item of a given collection which is quite large (millions of records)
Here is a sample of my collection
{
_id: ObjectId("614965487d5d1c55794ad324"),
hour: ISODate("2021-09-21T17:21:03.259Z"),
searches: [
ObjectId("614965487d5d1c55794ce670")
]
}
My start of aggregation is like so
[
{
$match: {
searches: {
$in: [ObjectId('614965487d5d1c55794ce670')],
},
},
},
{ $sort: { hour: -1 } },
{ $project: { hour: 1 } },
...
]
I have tried many things including
$sample which does not make the pick in the good order
Using $skip makes it very slow as the number given to skip grows
Using _id instead of $skip but my ids are unfortunately not created in an ordered manner
My goal is thus to retrieve the hour of a record, every 20000 record, so that I can then make a call to retrieve data by chunks of approximately 20000 records.
I imagine it would be possible to
sort, and number every records, then keep only the first, 20000, 40000, ..., and the last
Thanks for your help and let me know if you need more information

Group by calculated dates in MongoDB

I have data that looks like this:
[
{
"start_time" : ISODate("2017-08-22T19:43:41.442Z"),
"end_time" : ISODate("2017-08-22T19:44:22.150Z")
},
{
"start_time" : ISODate("2017-08-22T19:44:08.344Z"),
"end_time" : ISODate("2017-08-22T19:46:25.500Z")
}
]
Is there any way to run an aggregation query that will give me a frequency result like:
{
ISODate("2017-08-22T19:43:00.000Z"): 1,
ISODate("2017-08-22T19:44:00.000Z"): 2,
ISODate("2017-08-22T19:45:00.000Z"): 1,
ISODate("2017-08-22T19:46:00.000Z"): 1
}
Essentially I want to group by minute, with a sum, but the trick is that each record might count toward multiple groups. Additionally, as in the case with 19:45, the date is not explicitly mentioned in the data (it is calculated as being between two other dates).
At first I thought I could do this with a function like $minute. I could group by minute and check to see if the data fits in that range. However, I'm stuck on how that would be accomplished, if that's possible at all. I'm not sure how to turn a single entry into multiple date groups.
You can use below aggregation in 3.6 mongo version.
db.col.aggregate([
{"$addFields":{"date":["$start_time","$end_time"]}},
{"$unwind":"$date"},
{"$group":{
"_id":{
"year":{"$year":"$date"},
"month":{"$month":"$date"},
"day":{"$dayOfMonth":"$date"},
"hour":{"$hour":"$date"},
"minute":{"$minute":"$date"}
},
"count":{"$sum":1}
}},
{"$addFields":{
"_id":{
"$dateFromParts":{
"year":"$_id.year",
"month":"$_id.month",
"day":"$_id.day",
"minute":"$_id.minute"
}
}
}}
])

MongoDB index array issue

I have question about mongoDB (version:"3.4.10") index arrays as I see they don't work correctly. Maybe I am doing something wrong.
I have schedule document that has some props that aren't important for this question. But each schedule has it's plan (occurrence) period in array.
{ .... "plans": [ { "startDateTime": "2018-01-04T00:00:00Z",
"endDateTime": "2018-01-04T23:59:59Z" }, { "startDateTime":
"2018-01-11T00:00:00Z", "endDateTime": "2018-01-11T23:59:59Z" } ... ]
},
Now I need to search schedule documents by that array item and find all schedules that fit in that period.
I have created index plans.startDateTime and plans.endDateTime.
When I perform following query using compas option explain plan I get bad really bad results.
{"Plans.StartDateTime": {$lt: new Date ('2018-01-10')}, "Plans.EndDateTime": {$gte: new Date ('2018-01-15')} }
Results are (this is in test environment where number of documents are really low, in production ratio would be even higher)
Documents Returned:2823
Index Keys Examined:65708
Documents Examined:11554
When I go little bit deeper in analysis I got following (meaning that mongo is ignoring plan end date in index search):
"indexBounds": { "Plans.StartDateTime": [ "(true, new
Date(1515542400000))" ], "Plans.EndDateTime": [ "[MinKey,
MaxKey]" ] },
Can somebody please tell me how to create better indexes for following search because this one isn't working?
In order to find all scheduleDocuments having at least one plan overlapping with a given time interval (e.g. 2018-01-10 and 2018-01-14) you have to use $elemMatch MongoDB operator.
db.scheduleDocuments.find({
plans: {
$elemMatch: {
startDateTime: { $lte: ISODate("2018-01-14Z") },
endDateTime: { $gt: ISODate("2018-01-10Z") }
}
}
});
The rule used to test for overlapping intervals can be found here.
This search performs a collection scan, unless you create an index on the array.
db.scheduleDocuments.createIndex({
"plans.startDateTime": 1,
"plans.endDateTime": 1
});
Thanks to the index, unmatching documents in the collection are not scanned at all. An IXSCAN is performed and only matching documents are accessed to be fetched and returned.

mongodb statistics 5 million doc too slow

5 million mongo doc:
{
_id: xxx,
devID: 123,
logLevel: 5,
logTime: 1468464358697
}
indexes:
devID
my aggregate:
[
{$match: {devID: 123}},
{$group: {_id: {level: "$logLevel"}, count: {$sum: 1}}}
]
aggregate result:
{ "_id" : { "level" : 5 }, "count" : 5175872 }
{ "_id" : { "level" : 1 }, "count" : 200000 }
aggregate explain:
numYields:42305
29399ms
Q:
if mongo without writing(saving) data, it will take 29 seconds
if mongo is writing(saving) data, it will take 2 minutes
my aggregate result need to reply to web, so 29sec or 2min are too long
How can i solve it? preferably 10 seconds or less
Thanks all
In your example, the aggregation query for {devID: 123, logLevel:5} returns a count of 5,175,872 which looks like it counted all the documents in your collection (since you mentioned you have 5 million documents).
In this particular example, I'm guessing that the {$match: {devID: 123}} stage matches pretty much every document, hence the aggregation is doing what is essentially a collection scan. Depending on your RAM size, this could have the effect of pushing your working set out of memory, and slow down every other query your server is doing.
If you cannot provide a more selective criteria for the $match stage (e.g. by using a range of logTime as well as devID), then a pre-aggregated report may be your best option.
In general terms, a pre-aggregated report is a document that contains the aggregated information you require, and you update this document every time you insert into the related collection. For example, you could have a single document in a separate collection that looks like:
{log:
{devID: 123,
levelCount: [
{level: 5, count: 5175872},
{level: 1, count: 200000}
]
}}
where that document is updated with the relevant details every time you insert into the log collection.
Using a pre-aggregated report, you don't need to run the aggregation query anymore. The aggregated information you require is instead available using a single find() query instead.
For more examples on pre-aggregated reports, please see https://docs.mongodb.com/ecosystem/use-cases/pre-aggregated-reports/

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".