Get last record for several items at once with mongo - mongodb

In my mongo database, I have basically 2 collections:
pupils
{_id: ObjectID(539ab7ffefbb93120c9697f7), firstname: 'Arnold', lastname: 'Smith'}
{_id: ObjectID(539ab7ffefbb93120c5473c3), firstname: 'Steven', lastname: 'Jens'}
marks
{ date: '2014-06-12', value: 12, pupilID: 539ab7ffefbb93120c9697f7}
{ date: '2014-06-05', value: 9, pupilID: 539ab7ffefbb93120c9697f7}
{ date: '2014-05-10', value: 17, pupilID: 539ab7ffefbb93120c9697f7}
{ date: '2014-05-10', value: 7, pupilID: 539ab7ffefbb93120c5473c3}
Is there a way with mongoshell to get the last mark of each pupils without having to manually loop through the list of pupils and get the last mark for each one ?
Currently I loop through each pupils and perform a:
db.marks.find({pupilID: pupilID}).sort({_id: -1}).limit(1)
But I'm quite concerned regarding the performances if the marks collections contains a high number of items.

Well your dates are not the best example here as they are strings. You should convert them to proper "Date" types, but at least they are lexical for sorting.
Not the "join" you seem to be implicitly looking for, but you can get the $last mark for each student from your "marks" collection, which will probably do some way to helping your result:
db.marks.aggregate([
{ "$sort": { "date": 1 } },
{ "$group": {
"_id": "$pupilID",
"date": { "$last": "$date" },
"value": { "$last": "$value" }
}}
]}
And that will give you the last mark "value" by date for each "pupilID". The joining of data is up to you, but this is better than looping whole collections or otherwise firing off on query per "pupil".

Related

Bulk remove a special letter in collection

Edited:
Lets say I have a Mongo Database with a collection, lets call it products. In this collection I want to remove all special characters, lets say all dots, from all entries in a certain field, lets say price.
Also, how would i replace for example the info entries of all of my objects?
How would I do this through the mongo shell?
Example:
_id: 123324erwerew
name: 'moisture cream'
price: 30.00
info: 'Good Cream'
_id: 343324erwerew
name: 'moisture cream two'
price: 40.00
info: 'Good Cream also'
Lets say info in both of them should be : "best cream ever" and the dots should be gone for both prices
if your prices were like 30.11 and 40.11 the following pipeline update command would make the prices "3011" and "4011" as well as set info to best cream ever
db.products.updateMany({}, [
{
$set: {
price: {
$reduce: {
input: { $split: [{ $toString: "$price" }, "."] },
initialValue: "",
in: { $concat: ["$$value", "$$this"] }
}
}
}
},
{
$set: {
info: "best cream ever"
}
}
])
explanation about price update:
value of price is converted to a string
the resulting string is split by delimiter .
resulting parts are then concatenated back to a string with the use of $reduce
in mongodb v4.4 this is much simplified with the $replaceOne operator

Speed up aggregation on large collection

I currently have a database with about 270 000 000 documents. They look like this:
[{
'location': 'Berlin',
'product': 4531,
'createdAt': ISODate(...),
'value': 3523,
'minOffer': 3215,
'quantity': 7812
},{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}]
The database currently holds a bit over one month of data and has ~170 locations (in EU and US) with ~8000 products. These documents represent timesteps, so there are about ~12-16 entries per day, per product per location (at most 1 per hour though).
My goal is to retrieve all timesteps of a product in a given location for the last 7 days. For a single location this query works reasonable fast (150ms) with the index { product: 1, location: 1, createdAt: -1 }.
However, I also need these timesteps not just for a single location, but an entire region (so about 85 locations). I'm currently doing that with this aggregation, which groups all the entries per hour and averages the desired values:
this.db.collection('...').aggregate([
{ $match: { { location: { $in: [array of ~85 locations] } }, product: productId, createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) } } }, {
$group: {
_id: {
$toDate: {
$concat: [
{ $toString: { $year: '$createdAt' } },
'-',
{ $toString: { $month: '$createdAt' } },
'-',
{ $toString: { $dayOfMonth: '$createdAt' } },
' ',
{ $toString: { $hour: '$createdAt' } },
':00'
]
}
},
value: { $avg: '$value' },
minOffer: { $avg: '$minOffer' },
quantity: { $avg: '$quantity' }
}
}
]).sort({ _id: 1 }).toArray()
However, this is really really slow, even with the index { product: 1, createdAt: -1, location: 1 } (~40 secs). Is there any way to speed up this aggregation so it goes down to a few seconds at most? Is this even possible, or should I think about using something else?
I've thought about saving these aggregations in another database and just retrieving that and aggregating the rest, this is however really awkward for the first users on the site who have to sit 40 secs through waiting.
These are some ideas which can benefit the querying and performance. Whether all these will work together is matter of some trials and testing. Also, note that changing the way data is stored and adding new indexes means that there will changes to application, i.e., capturing data, and the other queries on the same data need to be carefully verified (that they are not affected in a wrong way).
(A) Storing a Day's Details in a Document:
Store (embed) a day's data within the same document as an array of sub-documents. Each sub-document represents an hour's entry.
From:
{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}
to:
{
location: 'London',
product: 1231,
createdAt: ISODate(...),
details: [ { value: 53523, minOffer: 44215, quantity: 2812 }, ... ]
}
This means about ten entries per document. Adding data for an entry will be pushing data into the details array, instead of adding a document as in present application. In case the hour's info (time) is required it can also be stored as part of the details sub-document; it will entirely depend upon your application needs.
The benefits of this design:
The number of documents to maintain and query will reduce (per
product per day about ten documents).
In the query, the group stage will go away. This will be just a
project stage. Note that the $project supports accumulators $avg and $sum.
The following stage will create the sums and averages for the day (or a document).
{
$project: { value: { $avg: '$value' }, minOffer: { $avg: '$minOffer' }, quantity: { $avg: '$quantity' } }
}
Note the increase in size of the document is not much, with the amount of details being stored per day.
(B) Querying by Region:
The present matching of multiple locations (or a region) with this query filer: { location: { $in: [array of ~85 locations] } }. This filter says : location: location-1, -or- location: location-3, -or- ..., location: location-50. Adding a new field , region, will filter with one value matching.
The query by region will change to:
{
$match: {
region: regionId,
product: productId,
createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) }
}
}
The regionId variable is to be supplied to match with the region field.
Note that, both the queries, "by location" and "by region", will benefit with the above two considerations, A and B.
(C) Indexing Considerations:
The present index: { product: 1, location: 1, createdAt: -1 }.
Taking into consideration, the new field region, newer indexing will be needed. The query with region cannot benefit without an index on the region field. A second index will be needed; a compound index to suit the query. Creating an index with the region field means additional overhead on write operations. Also, there will be memory and storage considerations.
NOTES:
After adding the index, both the queries ("by location" and "by region") need to be verified using explain if they are using their respective indexes. This will require some testing; a trial-and-error process.
Again, adding new data, storing data in a different format, adding new indexes requires to consider these:
Careful testing and verifying that the other existing queries perform as usual.
The change in data capture needs.
Testing the new queries and verifying if the new design performs as expected.
Honestly your aggregation is pretty much as optimized as it can get, especially if you have { product: 1, createdAt: -1, location: 1 } as an index like you stated.
I'm not exactly sure how your entire product is built, however the best solution in my opinion is to have another collection containing just the "relevant" documents from the past week.
Then you could query that collection with ease, This is quite easy to do in Mongo as well using a TTL Index.
If this not an option you could add a temporary field to the "relevant" documents and query on that making it somewhat faster to retrieve them, but maintaining this field will require you to have a process running every X time which could make your results now 100% accurate depending when you decide to run it.

Can I use populate before aggregate in mongoose?

I have two models, one is user
userSchema = new Schema({
userID: String,
age: Number
});
and the other is the score recorded several times everyday for all users
ScoreSchema = new Schema({
userID: {type: String, ref: 'User'},
score: Number,
created_date = Date,
....
})
I would like to do some query/calculation on the score for some users meeting specific requirement, say I would like to calculate the average of score for all users greater than 20 day by day.
My thought is that firstly do the populate on Scores to populate user's ages and then do the aggregate after that.
Something like
Score.
populate('userID','age').
aggregate([
{$match: {'userID.age': {$gt: 20}}},
{$group: ...},
{$group: ...}
], function(err, data){});
Is it Ok to use populate before aggregate? Or I first find all the userID meeting the requirement and save them in a array and then use $in to match the score document?
No you cannot call .populate() before .aggregate(), and there is a very good reason why you cannot. But there are different approaches you can take.
The .populate() method works "client side" where the underlying code actually performs additional queries ( or more accurately an $in query ) to "lookup" the specified element(s) from the referenced collection.
In contrast .aggregate() is a "server side" operation, so you basically cannot manipulate content "client side", and then have that data available to the aggregation pipeline stages later. It all needs to be present in the collection you are operating on.
A better approach here is available with MongoDB 3.2 and later, via the $lookup aggregation pipeline operation. Also probably best to handle from the User collection in this case in order to narrow down the selection:
User.aggregate(
[
// Filter first
{ "$match": {
"age": { "$gt": 20 }
}},
// Then join
{ "$lookup": {
"from": "scores",
"localField": "userID",
"foriegnField": "userID",
"as": "score"
}},
// More stages
],
function(err,results) {
}
)
This is basically going to include a new field "score" within the User object as an "array" of items that matched on "lookup" to the other collection:
{
"userID": "abc",
"age": 21,
"score": [{
"userID": "abc",
"score": 42,
// other fields
}]
}
The result is always an array, as the general expected usage is a "left join" of a possible "one to many" relationship. If no result is matched then it is just an empty array.
To use the content, just work with an array in any way. For instance, you can use the $arrayElemAt operator in order to just get the single first element of the array in any future operations. And then you can just use the content like any normal embedded field:
{ "$project": {
"userID": 1,
"age": 1,
"score": { "$arrayElemAt": [ "$score", 0 ] }
}}
If you don't have MongoDB 3.2 available, then your other option to process a query limited by the relations of another collection is to first get the results from that collection and then use $in to filter on the second:
// Match the user collection
User.find({ "age": { "$gt": 20 } },function(err,users) {
// Get id list
userList = users.map(function(user) {
return user.userID;
});
Score.aggregate(
[
// use the id list to select items
{ "$match": {
"userId": { "$in": userList }
}},
// more stages
],
function(err,results) {
}
);
});
So by getting the list of valid users from the other collection to the client and then feeding that to the other collection in a query is the onyl way to get this to happen in earlier releases.

List all values existing of a property?

Assume I have a Student collection:
{
name: "ABC",
age: 10,
address {
city: "CITY1",
state: "STATE",
}
}
{
name: "DEF",
age: 11,
address {
city: "CITY2",
state: "STATE",
}
}
{
name: "ABC",
age: 12,
address {
city: "CITY1",
state: "STATE",
}
}
Can I get the list of all unique City values from the list? For example, with the above 3 documents, I would like to get the list {"CITY1", "CITY2"}
I was just getting started with MongoDB from Relational Database, so this is a little confused for me, since I needed another Address table for it and I can just use SELECT DISTINCT to get what I want.
MongoDB has a similar db.collection.distinct() command.
To access elements in the address subdocument you need to use dot notation, so the complete query would be:
db.Student.distinct("address.city")
Some helpful documentation links to help you make the translation from SQL queries:
SQL to MongoDB Mapping Chart
SQL to Aggregation Mapping Chart
Just for notes, there is already distinct as mentioned, but for a more conventional response, use aggregate:
db.Student.aggregate([
{"$unwind": "$address" }},
{"$group": { "_id": "$address.city" }},
{"$project": { "_id": 0, "city" : "$_id" }}
])
Long winded compared to distinct, but it depends on what your eyes want.

Query multiple date ranges, return only specific key in MongoDB

In Mongo, I have a documents that look like the following:
dateRange: [{
"price": "200",
"dateStart": "2014-01-01",
"dateEnd": "2014-01-30"
},
{
"price": "220",
"dateStart": "2014-02-01",
"dateEnd": "2014-02-15"
}]
Nice and simple right? Just dates and prices. Now, the tricky party I'm is how would I go about creating a query to find the dateRange that fits with 2014-01-12, and then JUST return the price after it's found instead of the entire array of dateRanges?
These dateRanges can get quite large, and I'm trying to minimize the amount of data returned (if this is possible at all with Mongo). Note, the date format I can change up if required, I was just using the above for example purposes.
Any help is appreciated, thanks!
You want to use the $elemMatch operator, which is only valid in versions 2.2 upward. You will also need to make sure you use multikey indexes.
edit: To be clear you will also have to use the $elemMatch find operator as pointed out in comment below.
This being said, I agree with the gist of comment by mnemosyn. It would be better to have each element of the array represented as a single document.
quick example of $elemMatch to demonstrate the projection. Simply add $elemMatch to the find as well.
> db.test.save ( {
_id: 1,
zipcode: 63109,
students: [
{ name: "john", school: 102, age: 10 },
{ name: "jess", school: 102, age: 11 },
{ name: "jeff", school: 108, age: 15 }
]
} );
> db.test.find( { zipcode: 63109 }, { students: { $elemMatch: { school: 102 } } } ).pretty() );
{
"_id" : 1,
"students" : [
{
"name" : "john",
"school" : 102,
"age" : 10
}
]
}
Well, the problem with that schema is that it uses large embedded arrays - this can be quite inefficient, because a mongodb query will always find a document, not a subset of an embedded object. Even if you're using a projection, mongodb will have to read the entire object internally, so if the array becomes huge, say 100k entries, that will slow things down to a halt.
Why not simply separate these array elements into documents, e.g.
{
price : 200,
productId : ObjectId("foo"), // or whatever the price refers to
dateStart : "2014-01-01",
dateEnd : "2013-01-30"
}
This way, mongodb doesn't need to pull the entire object with all prices, but only the prices that match your date range. This will minimize the amount of data transferred. You can then also use the query projection to only return the price, i.e. db.collection.find({ criteria }, {"price" : 1, "_id" : 0}).
Of course, the number of objects will increase dramatically, but efficient indexing will solve that problem. The only inefficiency induced is the duplication of the productId, which is cheaper than dealing with huge embedded arrays.
P.S: I'd suggest using actual dates (ISODate) instead of strings, even if their format is sortable.