MongoDB process change stream sequentially - mongodb

I have an app that allows to make donations.
I want to listen to insertion of donation records and save the total amount donated to some projection document.
Currently I do it this:
Donation.watch(
[
{
$match: {
operationType: 'insert',
},
},
]
).on('change', async data => {
const projection = await Projection.findById(someId)
projection.total = projection.total + data.fullDoucment.amount
Projection.save();
});
It kinda works, but there's an obvious problem: if two donations come almost at the same time, then the projection would still refer to the outdated version of the projection, so only one of donations would effectively be accounted for.
Is it possible to somehow process the change stream synchronously, waiting for the processing of one record to finish before starting with the next one?
Do I need to hand-roll some sort of synchronous processing queue for that? Any tips which direction to take here?
edit: apparently it's possible to use $inc to build up a sum projection. But what if I needed to do something more complicated e.g. projection.total = await someAsyncCall(projection.total, data.fullDoucment.amount)?

For the initial question $inc works like a charm:
await Projection.findOneAndUpdate(
{
_id: someId,
},
{
$inc: { total: data.fullDoucment.amount},
}
);
For more complicated processing I'd need a serial queue or relying on mongodb versioning constraints.

Related

MapReduce function to return two outputs. MongoDB

I am currently using doing some basic mapReduce using MongoDB.
I currently have data that looks like this:
db.football_team.insert({name: "Tane Shane", weight: 93, gender: "m"});
db.football_team.insert({name: "Lily Jones", weight: 45, gender: "f"});
...
I want to create a mapReduce function to group data by gender and show
Total number of each gender, Male & Female
Average weight of each gender
I can create a map / reduce function to carry out each function seperately, just cant get my head around how to show output for both. I am guessing since the grouping is based on Gender, Map function should stay the same and just alter something ont he reduce section...
Work so far
var map1 = function()
{var key = this.gender;
emit(key, {count:1});}
var reduce1 = function(key, values)
{var sum=0;
values.forEach(function(value){sum+=value["count"];});
return{count: sum};};
db.football_team.mapReduce(map1, reduce1, {out: "gender_stats"});
Output
db.football_team.find()
{"_id" : "f", "value" : {"count": 12} }
{"_id" : "m", "value" : {"count": 18} }
Thanks
The key rule to "map/reduce" in any implementation is basically that the same shape of data needs to be emitted by the mapper as is also returned by the reducer. The key reason for this is part of how "map/reduce" conceptually works by quite possibly calling the reducer multiple times. Which basically means you can call your reducer function on output that was already emitted from a previous pass through the reducer along with other data from the mapper.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
That said, your best approach to "average" is therefore to total the data along with a count, and then simply divide the two. This actually adds another step to a "map/reduce" operation as a finalize function.
db.football_team.mapReduce(
// mapper
function() {
emit(this.gender, { count: 1, weight: this.weight });
},
// reducer
function(key,values) {
var output = { count: 0, weight: 0 };
values.forEach(value => {
output.count += value.count;
output.weight += value.weight;
});
return output;
},
// options and finalize
{
"out": "gender_stats", // or { "inline": 1 } if you don't need another collection
"finalize": function(key,value) {
value.avg_weight = value.weight / value.count; // take an average
delete value.weight; // optionally remove the unwanted key
return value;
}
}
)
All fine because both mapper and reducer are emitting data with the same shape and also expecting input in that shape within the reducer itself. The finalize method of course is just invoked after all "reducing" is finally done and just processes each result.
As noted though, the aggregate() method actually does this far more effectively and in native coded methods which do not incur the overhead ( and potential security risks ) of server side JavaScript interpretation and execution:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}}
])
And that's basically it. Moreover you can actually continue and do other things after a $group pipeline stage ( or any stage for that matter ) in ways that you cannot do with a MongoDB mapReduce implementation. Notably something like applying a $sort to the results:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}},
{ "$sort": { "avg_weight": -1 } }
])
The only sorting allowed by mapReduce is solely that the key used with emit is always sorted in ascending order. But you cannot sort the aggregated result in output in any other way, without of course performing queries when output to another collection, or by working "in memory" with returned results from the server.
As a "side note" ( though an important one ), you probably should also consider in "learning" that the reality is the "server-side JavaScript" functionality of MongoDB is really a work-around more than being a feature. When MongoDB was first introduced, it applied a JavaScript engine for server execution mostly to make up for features which had not yet been implemented.
Thus to make up for the lack of the complete implementation of many query operators and aggregation functions which would come later, adding a JavaScript engine was a "quick fix" to allow certain things to be done with minimal implementation.
The result over the years is those JavaScript engine features are gradually being removed. The group() function of the API is removed. The eval() function of the API is deprecated and scheduled for removal at the next major version. The writing is basically "on the wall" for the limited future of these JavaScript on the server features, as the clear pattern is where the native features provide support for something, then the need to continue support for the JavaScript engine basically goes away.
The core wisdom here being that focusing on learning these JavaScript on the server features, is probably not really worth the time invested unless you have a pressing use case that currently cannot be solved by any other means.

How can i lock a document to read a value and then update it

In my app the users can subscribe to an event. There is a max number of participants allowed on each event. On booking an event i add the id of the user to an array within the event document. When the length of the array equals the max number of participants the user should instead get added to a second array with stand by participants.
My concern is that if there are many concurrent bookings the event could get overbooked if the write of one booking happens after another booking has read the actual size of the array and then itself adds the user to the event.
As i'm a newbee im unsure on how to implement this with mongodb. The best thing i guess would be if i could lock the document when the reading starts and unlock it when the writing has finished.
I searched for locking a document on mongodb but found no examples. Perhaps i used the wrong terms.
I also found findAndModify but no hint if this method would solve my problem.
...
const module = await Modules.findOne({ _id: args.moduleId });
if (!module.participantsIds ||
module.participantsIds.length < module.maxParticipants) {
updateAction = {
$addToSet: { "participantsIds": userId }
};
} else {
updateAction = {
$addToSet: {
"standByParticipants": {
"participantId": userId, "timeStamp": new Date()
}
}
};
}
await Modules.update(
//query
{
_id: args.moduleId
},
//update
updateAction,
//options
{
"multi": false,
"upsert": false
}
);
...
It would be great if someone could lead me in the right direction or give a hint on how to go about this problem.
Update:
the answer of krishna Prasad below clarified to me the use of findAndModify and seems to solve my problem.
Just out of curiosity i'd appreciate if someone could give an example with a lock on the document, if this is even possible.
Best
Markus
As per the official documentation:
When modifying a single document, both findAndModify and the update() method atomically update the document. See Atomicity and Transactions for more details about interactions and order of operations of these methods.
Link: Just before below link: https://docs.mongodb.com/manual/reference/command/findAndModify/#transactions
And also have a look at the stackoverflow ans Is mongoDB's findAndModify "transaction-save"

Calculating collection stats for a subset of documents in MongoDB

I know the cardinal rule of SE is to not ask a question without giving examples of what you've already tried, but in this case I can't find where to begin. I've looked at the documentation for MongoDB and it looks like there are only two ways to calculate storage usage:
db.collection.stats() returns the statistics about the entire collection. In my case I need to know the amount of storage being used to by a subset of data within a collection (data for a particular user).
Object.bsonsize(<document>) returns the storage size of a single record, which would require a cursor function to calculate the size of each document, one at a time. My only concern with this approach is performance with large amounts of data. If a single user has tens of thousands of documents this process could take too long.
Does anyone know of a way to calculate the aggregate document size of set of records within a collection efficiently and accurately.
Thanks for the help.
This may not be the most efficient or accurate way to do it, but I ended up using a Mongoose plugin to get the size of the JSON representation of the document before it's saved:
module.exports = exports = function defaultPlugin(schema, options){
schema.add({
userId: { type: mongoose.Schema.Types.ObjectId, ref: "User", required: true },
recordSize: Number
});
schema.pre('save', function(next) {
this.recordSize = JSON.stringify(this).length;
next();
});
}
This will convert the schema object to a JSON representation, get it's length, then store the size in the document itself. I understand that this will actually add a tiny bit of extra storage to record the size, but it's the best I could come up with.
Then, to generate a storage report, I'm using a simple aggregate call to get the sum of all of the recordSize values in the collection, filtered by userId:
mongoose.model('YouCollectionName').aggregate([
{
$match: {
userId: userId
}
},
{
$group: {
_id: null,
recordSize: { $sum: '$recordSize'},
recordCount: { $sum: 1 }
}
}
], function (err, results) {
//Do something with your results
});

Maintaining total count of a set in another collection

I got simple scenario of two entities: post; bumps (ie upvote).
Example of a post:
{_id: 'happy_days', 'title': 'Happy days', text: '...', bumps: 2}
Example of a bump:
{_id: {user: 'jimmy', post: 'happy_days'}}
{_id: {user: 'hans', post: 'happy_days'}}
Question: how do I maintain correct bumps count in post under all circumstances (and failures)?
The method I have come up with so far is:
To bump, upsert and check for existence. Only if inserted, increase bumps count.
To unbump, delete and check for existence. Only if deleted, decrease bumps count.
Above fails if the app crashes between the two ops and the only way to correct the bumps stats is to query all documents in bump collection and recalculate everything offline (ie there is no way to know which post have incorrect bumps count).
I suggest that you stick with what you already have. The worst that can happen if there is a failover/connection issue between your two operations is that you bump count is wrong. So what? This is not the end of the world, and nobody is going to care too much if a bump count is either 812 or 813. You can always recreate the count anyway by checking how many bumps you have for each post by running an aggregation query if something went wrong. Embrace eventual consistency!
As an alternative to updating the data in multiple places (which, for read performance, will probably be the best but as you noticed will complicate updates) it may be worth considering storing uid's of the bumps in an array (here called bump_uids) directly on the post, and just count the bumps when needed using aggregate framework;
> db.test.aggregate( [ { $match: { _id:'happy_days' } },
{ $project: { bump_uids: 1 } },
{ $unwind: '$bump_uids' },
{ $group: {_id:'$_id', bumps: { $sum:1 } } } ] )
>>> { "result" : [ { "_id" : "happy_days", "bumps" : 3 } ], "ok" : 1 }
Since MongoDB does not yet support triggers ( https://jira.mongodb.org/browse/SERVER-124 ) you have to do this the gritty way with application logic.
As a brief example:
db.follower.insert({fromId:u,toId:c});
db.user.update({_id:u},{$inc:{totalFollowing:1}});
db.user.update({_id:c},{$inc:{totalFollowers:1}});
Yes, it is not atomic etc etc however it is the way to do it. In reality many update counters like this, whether in MongoDB or not.

how do I do 'not-in' operation in mongodb?

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});