Query multiple date ranges, return only specific key in MongoDB - mongodb

In Mongo, I have a documents that look like the following:
dateRange: [{
"price": "200",
"dateStart": "2014-01-01",
"dateEnd": "2014-01-30"
},
{
"price": "220",
"dateStart": "2014-02-01",
"dateEnd": "2014-02-15"
}]
Nice and simple right? Just dates and prices. Now, the tricky party I'm is how would I go about creating a query to find the dateRange that fits with 2014-01-12, and then JUST return the price after it's found instead of the entire array of dateRanges?
These dateRanges can get quite large, and I'm trying to minimize the amount of data returned (if this is possible at all with Mongo). Note, the date format I can change up if required, I was just using the above for example purposes.
Any help is appreciated, thanks!

You want to use the $elemMatch operator, which is only valid in versions 2.2 upward. You will also need to make sure you use multikey indexes.
edit: To be clear you will also have to use the $elemMatch find operator as pointed out in comment below.
This being said, I agree with the gist of comment by mnemosyn. It would be better to have each element of the array represented as a single document.
quick example of $elemMatch to demonstrate the projection. Simply add $elemMatch to the find as well.
> db.test.save ( {
_id: 1,
zipcode: 63109,
students: [
{ name: "john", school: 102, age: 10 },
{ name: "jess", school: 102, age: 11 },
{ name: "jeff", school: 108, age: 15 }
]
} );
> db.test.find( { zipcode: 63109 }, { students: { $elemMatch: { school: 102 } } } ).pretty() );
{
"_id" : 1,
"students" : [
{
"name" : "john",
"school" : 102,
"age" : 10
}
]
}

Well, the problem with that schema is that it uses large embedded arrays - this can be quite inefficient, because a mongodb query will always find a document, not a subset of an embedded object. Even if you're using a projection, mongodb will have to read the entire object internally, so if the array becomes huge, say 100k entries, that will slow things down to a halt.
Why not simply separate these array elements into documents, e.g.
{
price : 200,
productId : ObjectId("foo"), // or whatever the price refers to
dateStart : "2014-01-01",
dateEnd : "2013-01-30"
}
This way, mongodb doesn't need to pull the entire object with all prices, but only the prices that match your date range. This will minimize the amount of data transferred. You can then also use the query projection to only return the price, i.e. db.collection.find({ criteria }, {"price" : 1, "_id" : 0}).
Of course, the number of objects will increase dramatically, but efficient indexing will solve that problem. The only inefficiency induced is the duplication of the productId, which is cheaper than dealing with huge embedded arrays.
P.S: I'd suggest using actual dates (ISODate) instead of strings, even if their format is sortable.

Related

MongoDB - $push accumulator is slowing query down

To develop a web application, I use MongooDB as Back-End and I need to retrieve data from it. On a particular page I need to recover prices' history from specific brands. In my MongoDB collection, here is how a document/product is saved :
{
brand : "exampleBrand"
prices : Array
0: Object
date "2022-03-08"
price: 1900
1: Object
date "2022-03-09"
price: 1910
}
My goal is then to retieve dates and prices from a specific brand in the following format :
{
{
date : "2022-03-08",
prices : [price_product1, priceproduct2,...]
}
{
date : "2022-03-09",
prices : [price_product1, priceproduct3,...]
}
}
In order to do that I have designed the following query :
db.Prices.aggregate([
{
$match: {
{brand: "exampleBrand"}],
},
},
{
$project: {
_id: 0,
prices: 1,
},
},
{
$unwind: '$prices',
},
{
$group: {
_id: "$prices.date",
prix: {
$push: "$prices.price",
},
},
},
]);
Once I have these results I can go on with different calculations etc... to display on my page. However, there are approcimatively 90000 documents, each of them having in average 30 prices and dates. Thus, the group stage of the aggregation pipeline is taking a long time.
I have try different indexes on "prices", "prices.date", "brand, prices" but none of them seem to speed up the query. I have also tried twisting and changing the query but couldn't find a more efficient way to get my results. Would anyone have an idea on how to achieve this ?
Thank you,
For faster querying here with these conditions I think the only way is to use mechanisms like Redis or Memcached because the query, especially in the array, would cost a lot of io and process for the aggregation process.
P.S. I doubt that but if you somehow are able to change your data structure in a way that it would be flat it would be faster but not like caching method faster.
example:
{
brand : "exampleBrand"
price1 : 1900
date1 : "2022-03-08"
price2 : 2100
date2 : "2022-03-29"
}

What is the best way to use collection as round robin in Mongodb

I have a collection named items with three documents.
{
_id: 1,
item: "Pencil"
}
{
_id: 1,
item: "Pen"
}
{
_id: 1,
item: "Sharpner"
}
How could I query to get the document as round-robin?
Consider I got multiple user requests at the same time.
so one should get Pencil other will get Pen and then other will get Sharpner.
then start again from the first one.
If changing schema is a choice I am also ready for that.
I think I found a way to do this without changing the schema. It is based on skip() and limit(). Moreover you can specify to keep the internal sorting order for returned documents but as the guide says you should not rely on this, especially because you are losing performance since the indexing is overridden:
The $natural parameter returns items according to their natural order
within the database. This ordering is an internal implementation
feature, and you should not rely on any particular structure within
it.
Anyway, this is the query:
db.getCollection('YourCollection').find().skip(counter).limit(1)
Where counter stores the current index for your documents.
Few things to start..
_id has to be unique across a collection especially when the collection is only a replication set.
This is a very stateful requirement and would not work well with a distributed set of services for example.
With that said, assuming you really just want to iterate from the database i would use cursors to accomplish this. This will do a collection scan and is very inefficient for the record.
var myCursor = db.items.find().sort({_id:1});
while (myCursor.hasNext()) {
printjson(myCursor.next());
}
My suggestion is that you should pull all results from the database at once and do your iteration in the application tier.
var myCursor = db.inventory.find().sort({_id:1});
var documentArray = myCursor.toArray();
documentArray.foreach(doSomething)
If this is about distribution you may consider fetching random documents instead of round-robin via aggregation/$sample:
db.collection.aggregate([
{
"$sample": {
"size": 1
}
}
])
playground
Or there is options to randomize via $rand ...
Use text findOneAndUpdate after restructuring the data objects
db.counter.findOneAndUpdate( {}, pipeline)
{
"_id" : ObjectId("624317a681e72a1cfd7f2b7e"),
"values" : [
"Pencil",
"Pen",
"Sharpener"
],
"selected" : "Pencil",
"counter" : 1
}
db.counter.findOneAndUpdate( {}, pipeline)
{
"_id" : ObjectId("624317a681e72a1cfd7f2b7e"),
"values" : [
"Pencil",
"Pen",
"Sharpener"
],
"selected" : "Pen",
"counter" : 2
}
where the data object is now:
{
"_id" : ObjectId("6242fe3bc1551d0f3562bcb2"),
"values" : [
"Pencil",
"Pen",
"Sharpener"
],
"selected" : "Pencil",
"counter" : 1
}
and the pipeline is:
[{$project: {
values: 1,
selected: {
$arrayElemAt: [
'$values',
'$counter'
]
},
counter: {
$mod: [
{
$add: [
'$counter',
1
]
},
{
$size: '$values'
}
]
}
}}]
This has some merits:
Firstly, using findOneAndUpdate means that moving the pointer to the
next item in the list and reading the object happen at once.
Secondly,by using the {$size: "$values"} adding a value into the list
doesn't change the logic.
And, instead of a string an object could be used instead.
Problems:
This method would be unwieldy with more than 10's of entries
It is hard to prove that this method works as advertised so there is an accompanying Kotlin project. The project uses coroutines so it is calling a find/update asynchronously.
text GitHub
The alternative (assuming 50K items and not 3):
Set-up a simple counter {counter: 0} and update as follows:
db.counter.findOneAndUpdate({},
[{$project: {
counter: {
$mod: [
{
$add: [
'$counter',
1
]
},
50000
]
}
}}])
Then use a simple select query to find the right document.
I've updated the github to include this example.

MongoDB return latest full document for each id (Full document Object containing all fields like sub document arrays etc) [duplicate]

I want to get the last document for each station with all other fields :
{
"_id" : ObjectId("535f5d074f075c37fff4cc74"),
"station" : "OR",
"t" : 86,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
"_id" : ObjectId("535f5d114f075c37fff4cc75"),
"station" : "OR",
"t" : 82,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
"_id" : ObjectId("535f5d364f075c37fff4cc76"),
"station" : "WA",
"t" : 79,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
I need to have t and station for the latest dt per station.
With the aggregation framework :
db.temperature.aggregate([{$sort:{"dt":1}},{$group:{"_id":"$station", result:{$last:"$dt"}, t:{$last:"$t"}}}])
returns
{
"result" : [
{
"_id" : "WA",
"result" : ISODate("2014-04-29T08:02:57.165Z"),
"t" : 79
},
{
"_id" : "OR",
"result" : ISODate("2014-04-29T08:02:57.165Z"),
"t" : 82
}
],
"ok" : 1
}
Is this the most efficient way to do that ?
Thanks
To directly answer your question, yes it is the most efficient way. But I do think we need to clarify why this is so.
As was suggested in alternatives, the one thing people are looking at is "sorting" your results before passing to a $group stage and what they are looking at is the "timestamp" value, so you would want to make sure that everything is in "timestamp" order, so hence the form:
db.temperature.aggregate([
{ "$sort": { "station": 1, "dt": -1 } },
{ "$group": {
"_id": "$station",
"result": { "$first":"$dt"}, "t": {"$first":"$t"}
}}
])
And as stated you will of course want an index to reflect that in order to make the sort efficient:
However, and this is the real point. What seems have been overlooked by others ( if not so for yourself ) is that all of this data is likely being inserted already in time order, in that each reading is recorded as added.
So the beauty of this is the the _id field ( with a default ObjectId ) is already in "timestamp" order, as it does itself actually contain a time value and this makes the statement possible:
db.temperature.aggregate([
{ "$group": {
"_id": "$station",
"result": { "$last":"$dt"}, "t": {"$last":"$t"}
}}
])
And it is faster. Why? Well you don't need to select an index ( additional code to invoke) you also don't need to "load" the index in addition to the document.
We already know the documents are in order ( by _id ) so the $last boundaries are perfectly valid. You are scanning everything anyway, and you could also "range" query on the _id values as equally valid for between two dates.
The only real thing to say here, is that in "real world" usage, it might just be more practical for you to $match between ranges of dates when doing this sort of accumulation as opposed to getting the "first" and "last" _id values to define a "range" or something similar in your actual usage.
So where is the proof of this? Well it is fairly easy to reproduce, so I just did so by generating some sample data:
var stations = [
"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL",
"GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA",
"ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE",
"NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK",
"OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
"VA", "WA", "WV", "WI", "WY"
];
for ( i=0; i<200000; i++ ) {
var station = stations[Math.floor(Math.random()*stations.length)];
var t = Math.floor(Math.random() * ( 96 - 50 + 1 )) +50;
dt = new Date();
db.temperatures.insert({
station: station,
t: t,
dt: dt
});
}
On my hardware (8GB laptop with spinny disk, which is not stellar, but certainly adequate) running each form of the statement clearly shows a notable pause with the version using an index and a sort ( same keys on index as the sort statement). It is only a minor pause, but the difference is significant enough to notice.
Even looking at the explain output ( version 2.6 and up, or actually is there in 2.4.9 though not documented ) you can see the difference in that, though the $sort is optimized out due to the presence of an index, the time taken appears to be with index selection and then loading the indexed entries. Including all fields for a "covered" index query makes no difference.
Also for the record, purely indexing the date and only sorting on the date values gives the same result. Possibly slightly faster, but still slower than the natural index form without the sort.
So as long as you can happily "range" on the first and last _id values, then it is true that using the natural index on the insertion order is actually the most efficient way to do this. Your real world mileage may vary on whether this is practical for you or not and it might simply end up being more convenient to implement the index and sorting on the date.
But if you were happy with using _id ranges or greater than the "last" _id in your query, then perhaps one tweak in order to get the values along with your results so you can in fact store and use that information in successive queries:
db.temperature.aggregate([
// Get documents "greater than" the "highest" _id value found last time
{ "$match": {
"_id": { "$gt": ObjectId("536076603e70a99790b7845d") }
}},
// Do the grouping with addition of the returned field
{ "$group": {
"_id": "$station",
"result": { "$last":"$dt"},
"t": {"$last":"$t"},
"lastDoc": { "$last": "$_id" }
}}
])
And if you were actually "following on" the results like that then you can determine the maximum value of ObjectId from your results and use it in the next query.
Anyhow, have fun playing with that, but again Yes, in this case that query is the fastest way.
An index is all you really need:
db.temperature.ensureIndex({ 'station': 1, 'dt': 1 })
for s in db.temperature.distinct('station'):
db.temperature.find({ station: s }).sort({ dt : -1 }).limit(1)
of course using whatever syntax is actually valid for your language.
Edit: You are correct that a loop like this incurs a round-trip per station, and it's great for a few stations, and not so good for 1000. You do still want the compound index on station+dt, though, and to take advantage of a descending sort:
db.temperature.aggregate([
{ $sort: { station: 1, dt: -1 } },
{ $group: { _id: "$station", result: {$first:"$dt"}, t: {$first:"$t"} } }
])
As far as the aggregation query you've posted, I'd make certain that you have an index on dt:
db.temperature.ensureIndex({'dt': 1 })
This will make certain that the $sort at the beginning of the aggregation pipeline is as efficient as possible.
As to whether or not this is the most efficient way to get this data, vs. a query in a loop, will likely be a function of how many data points you have. In the beginning, with "thousands of stations" and perhaps hundreds of thousands of data points I'd think the aggregation approach will be faster.
However, as you add more and more data an issue is that the aggregation query will continue to touch all the documents. This will get increasingly expensive as you scale up to millions or more documents. One approach for that case would be to add a $limit right after the $sort to limit the total number of documents being considered. That's a bit hacky and inexact but it would help to limit the total number of documents that need to be accessed.

MongoDb insert or add in embedded document

This is my final document structure. I am on mongo 3.0.
{
_id: "1",
stockA: [ { size: "S", qty: 25 }, { size: "M", qty: 50 } ],
stockB: [ { size: "S", qty: 27 }, { size: "M", qty: 58 } ]
}
I am randomly adding elements inside stockA or stockB collections.
I would like come up with a query which would satisfy following rules:
If item is not exist - create whole document and insert item in stock A or B collection.
If item is exist but stock A or B collection is missing create stock collection and add element inside collection.
if item is exist and stock collection is exist just append element inside collection.
Question: Is it possible to achieve all that in one query? My main requirement that current insert has to be extremely fast and scalable.
If yes, could you please help me to come up with required query.
If no, could you please guide me how do I achieve requirements in fastest and cleanest way.
Thanks for any help!
The operator you are looking for is $addToSet or $push in combination with upsert:true.
Both operators can take multiple key/value pairs. It will then operate separately on each key. Both will create the array when it doesn't exist yet, and in either case will add the value.
The difference is that $addToSet will first check if an identical element already exists in the array. When performance is an issue, keep in mind that MongoDB must perform a linear search to do this. So for very large arrays, $addToSet can become quite slow.
The upsert:true option to the update function will result in the document being created when it isn't found.
You can use the update method with the upsert option which will create a new document when no document matches the query criteria.
db.collection.update({ "_id": 1 },
{
"$push" : {
"stockB" : {
"size" : "M",
"qty" : 51
},
"stockA" : {
"size" : "M",
"qty" : 21
}
}
},
{ "upsert": true }
)

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".