MongoDB get every second result - mongodb

In MongoDB all documents have a date field, it is a timestamp.
There is a lot of data, and I want to get only some part of it, for every interval:
e.g. 400ms
1402093316030<----
1402093316123
1402093316223
1402093316400<----
1402093316520
1402093316630
1402093316824<----
Is it possible to get every other, or every third result?
Or better first document every 400 ms?

You can do this with the aggregation framework and a little date math. Let's say you have a "timestamp" field and addtional fields "a", "b" and "c":
db.collection.aggregate([
{ "$group": {
"_id": {
"$subtract": [
"$timestamp",
{ "$mod": [ "$timestamp", 400 ] }
]
},
"timestamp": { "$first": "$timestamp" },
"a": { "$first": "$a" },
"b": { "$first": "$b" },
"c": { "$first": "$c" }
}}
])
So the date math there "groups" on the values of the "timestamp" field at 400ms intervals. The rest of the data is identified with the $first operator, which picks the "last" value from the field as found on those grouping boundaries.
If you otherwise wan the "last" item on those boundaries then you switch to use the $lastoperator instead.
The end result is the last document that occurred every 400 millisecond interval.
See the aggregate command and the Aggregation Framework operators for additional reference.

Related

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

MongoDB slow aggregate time

I'm facing an issue where the aggregate function is performing very slowly where it takes about 30 seconds to gather all my data. Assume 1 of the record in this structure:
{
"_id":{
"$oid":"5909a5cefece40f172895a6b"
},
"Record":1,
"Link":"https://www.google.com",
"Location":["loc1", "loc2", "loc3"],
"Organization":["org1", "org2", "org3"],
"Date":2017,
"PeoplePPL":["ppl1", "ppl2", "ppl3"]
}
And the aggregate query as follows:
db.testdata_4.aggregate([{
"$unwind": "$PeoplePPL"
},{
"$unwind": "$Location"
},{
"$match": {
Date: {
$gte: lowerBoundYear,
$lte: upperBoundYear
}
}
},{
"$group": {
"_id": {
"People": "$PeoplePPL",
"Date": "$Date"
},
Links: {
$addToSet: "$Link"
},
Locations: {
$addToSet: "$Location"
}
}
},{
"$group": {
"_id": "$_id.People",
Record: {
$push: {
"Country": "$Locations",
"Year": "$_id.Date",
"Links": "$Links"
}
}
}
}]).toArray()
There are a total of 154 records in the "testdata_4" collection, and upon aggregation, there will be 5571 records returned with the query time of 28 seconds. I have performed the ensureIndex() on "Locations" and "Date". Is this supposed to be normal as the number of records returned increases?
If it isn't normal, may I know if there's a workaround to decrease my query time to 5 seconds at max instead of having it at 28 seconds or more?
It's very likely that the index on Date isn't being used.
The $match and $sort operators can take advantage of indexes when they are being used at the beginning of the pipeline. In this case, the filters are applied after several $unwind stages, which mean it likely isn't be used.
Suggestions:
Move the $match stage to the beginning of the pipeline
The "Location", "Date" and "Link" fields aren't arrays, so it isn't immediately clear why there are $unwind aggregation stages on these fields. You may want to remove these.

Aggregate Group by Date with Daylight Saving Offset

I'm trying to use mongo aggregation to group documents by the week of a timestamp on each document. All timestamps are stored in UTC time and I need to calculating the week using the clients time not UTC time.
I can provide and add the clients UTC offset as shown below but this doesn't always work due to daylight savings time. The offset is different depending on the date and therefore adjusting all the timestamps with the offset of the current date won't do.
Does anyone know of a way to group by week that consistently accounts for daylight savings time?
db.collection.aggregate([
{ $group:
{ "_id":
{ "$week":
{ "$add": [ "$Timestamp", clientsUtcOffsetInMilliseconds ] }
}
},
{ "firstTimestamp":
{ "$min": "$Timestamp"}
}
}
]);
The basic concept here is to make your query "aware" of when "daylight savings" both "starts" and "ends" for the given query period and simply supply this test to $cond in order to determine which "offset" to use:
db.collection.aggregate([
{ "$group": {
"_id": {
"$week": {
"$add": [
"$Timestamp",
{ "$cond": [
{ "$and": [
{ "$gte": [
{ "$dayOfyear": "$Timestamp" },
daylightSavingsStartDay
]},
{ "$lt": [
{ "$dayOfYear": "$Timestamp" },
daylightSavingsEndDay
]}
]},
daylightSavingsOffset,
normalOffset
]}
]
}
},
"min": { "$min": "$Timestamp" }
}}
])
So you can make that a little more complex when covering several years, but it still is the basic principle. In in the southern hemisphere you are always spanning a year so each condition would be a "range" of "start" to "end of year" and "begining of year" to "end". Therefore an $or with an inner $and within the "range" operators demonstrated.
If you have different values to apply, then detect when you "should" choose either and then apply using $cond.

How to get multiple counts in one query for one field?

So what i'm trying to do is to get multiple counts of one field depending on a min max value.
Collection holds something like
{name:'hi',price:100},
{name:'hi',price:134},
{name:'hi',price:500}
What i want to get is for example the count of items that are between price 100-200, 200-300, 300-400, 400-500.
Is there a way to do this in mongoDB with one query? Is there a way to get the query without knowing min max?
You want .aggregate() here with the $cond ternary operator to determine the grouping id withing $group:
db.collection.aggregate([
{ "$match": {
"price": { "$gte": 100, "$lte" 500 }
}},
{ "$group": {
"_id": {
"$cond": [
{ "$lte": [ "$price", 200 ] },
"100-200",
{ "$cond": [
{ "$lte": [ "$price", 300 ] },
"200-300",
{ "$cond": [
{ "$lte": [ "$price", 400 ] },
"300-400",
"400-500"
]}
]}
]
},
"count": { "$sum": 1 }
}}
])
As a "ternary" if/then/else the $cond will evaluate the expression in the first argument "if" and then either return the second argument "then" where true or the third "else" where false.
The cascading logic means that you "nest" each ternary operation inside the false assertion till you reach an eventual result.
With the grouping _id value provided by conditions, you then just use $sum with an argument of 1 to "count" the matches in the group.
This gives you a response on the sample as:
{ "_id": "100-200", "count": 2 }
{ "_id": "400-500", "count": 1 }
The $match is making sure that all results are in the "ranges" that wil be tested. If you exclude that then you likely want a last $cond "else" condition to return another value if the "price" was outside of an expected "range".
If you are looking to return "each" range, then you are better off inspecting the result and inserting a 0 count entry for every range that is not returned.

MongoDB Nested Array Intersection Query

and thank you in advance for your help.
I have a mongoDB database structured like this:
{
'_id' : objectID(...),
'userID' : id,
'movies' : [{
'movieID' : movieID,
'rating' : rating
}]
}
My question is:
I want to search for a specific user that has 'userID' : 3, for example, get all is movies, then i want to get all the other users that have at least, 15 or more movies with the same 'movieID', then with that group i wanna select only the users that have those 15 movies in similarity and have one extra 'movieID' that i choose.
I already tried aggregation, but failed, and if i do single queries like getting all the users movies from a user, the cycling every user movie and comparing it takes a bunch of time.
Any ideias?
Thank you
There are a couple of ways to do this using the aggregation framework
Just a simple set of data for example:
{
"_id" : ObjectId("538181738d6bd23253654690"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 2, "rating": 6 },
{ "_id": 3, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654691"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 4, "rating": 6 },
{ "_id": 2, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654692"),
"movies": [
{ "_id": 2, "rating": 5 },
{ "_id": 5, "rating": 6 },
{ "_id": 6, "rating": 7 }
]
}
Using the first "user" as an example, now you want to find if any of the other two users have at least two of the same movies.
For MongoDB 2.6 and upwards you can simply use the $setIntersection operator along with the $size operator:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document if you want to keep more than `_id`
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
}},
// Unwind the array
{ "$unwind": "$movies" },
// Build the array back with just `_id` values
{ "$group": {
"_id": "$_id",
"movies": { "$push": "$movies._id" }
}},
// Find the "set intersection" of the two arrays
{ "$project": {
"movies": {
"$size": {
"$setIntersection": [
[ 1, 2, 3 ],
"$movies"
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
This is still possible in earlier versions of MongoDB that do not have those operators, just using a few more steps:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document along with the "set" to match
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
"set": { "$cond": [ 1, [ 1, 2, 3 ], 0 ] }
}},
// Unwind both those arrays
{ "$unwind": "$movies" },
{ "$unwind": "$set" },
// Group back the count where both `_id` values are equal
{ "$group": {
"_id": "$_id",
"movies": {
"$sum": {
"$cond":[
{ "$eq": [ "$movies._id", "$set" ] },
1,
0
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
In Detail
That may be a bit to take in, so we can take a look at each stage and break those down to see what they are doing.
$match : You do not want to operate on every document in the collection so this is an opportunity to remove the items that are not possibly matches even if there still is more work to do to find the exact ones. So the obvious things are to exclude the same "user" and then only match the documents that have at least one of the same movies as was found for that "user".
The next thing that makes sense is to consider that when you want to match n entries then only documents that have a "movies" array that is larger than n-1 can possibly actually contain matches. The use of $and here looks funny and is not required specifically, but if the required matches were 4 then that actual part of the statement would look like this:
"$and": [
{ "movies": { "$not": { "$size": 1 } } },
{ "movies": { "$not": { "$size": 2 } } },
{ "movies": { "$not": { "$size": 3 } } }
]
So you basically "rule out" arrays that are not possibly long enough to have n matches. Noting here that this $size operator in the query form is different to $size for the aggregation framework. There is no way for example to use this with an inequality operator such as $gt is it's purpose is to specifically match the requested "size". Hence this query form to specify all of the possible sizes that are less than.
$project : There are a few purposes in this statement, of which some differ depending on the MongoDB version you have. Firstly, and optionally, a document copy is being kept under the _id value so that these fields are not modified by the rest of the steps. The other part here is keeping the "movies" array at the top of the document as a copy for the next stage.
What is also happening in the version presented for pre 2.6 versions is there is an additional array representing the _id values for the "movies" to match. The usage of the $cond operator here is just a way of creating a "literal" representation of the array. Funny enough, MongoDB 2.6 introduces an operator known as $literal to do exactly this without the funny way we are using $cond right here.
$unwind : To do anything further the movies array needs to be unwound as in either case it is the only way to isolate the existing _id values for the entries that need to be matched against the "set". So for the pre 2.6 version you need to "unwind" both of the arrays that are present.
$group : For MongoDB 2.6 and greater you are just grouping back to an array that only contains the _id values of the movies with the "ratings" removed.
Pre 2.6 since all values are presented "side by side" ( and with lots of duplication ) you are doing a comparison of the two values to see if they are the same. Where that is true, this tells the $cond operator statement to return a value of 1 or 0 where the condition is false. This is directly passed back through $sum to total up the number of matching elements in the array to the required "set".
$project: Where this is the different part for MongoDB 2.6 and greater is that since you have pushed back an array of the "movies" _id values you are then using $setIntersection to directly compare those arrays. As the result of this is an array containing the elements that are the same, this is then wrapped in a $size operator in order to determine how many elements were returned in that matching set.
$match: Is the final stage that has been implemented here which does the clear step of matching only those documents whose count of intersecting elements was greater than or equal to the required number.
Final
That is basically how you do it. Prior to 2.6 is a bit clunkier and will require a bit more memory due to the expansion that is done by duplicating each array member that is found by all of the possible values of the set, but it still is a valid way to do this.
All you need to do is apply this with the greater n matching values to meet your conditions, and of course make sure your original user match has the required n possibilities. Otherwise just generate this on n-1 from the length of the "user's" array of "movies".