MongoDB performance: $and vs. single object with multiple keys - mongodb

I have an internal service that does some operations on an order, it has a built-in required filter, and the service users can pass additional filter.
Two approaches to achieve the same thing:
A) Using $and:
async function getOrders ({ optionalFilter = {} }) {
const baseFilter = { amount: { $gt: 10 } };
const mergedFilter = { $and: [baseFilter, optionalFilter] };
return await Order.find(mergedFilter);
}
B) Merging all in the same object
async function getOrders ({ optionalFilter = {} }) {
const baseFilter = { amount: { $gt: 10 } };
const mergedFilter = { ...baseFilter, ...optionalFilter };
return await Order.find(mergedFilter);
}
I prefer the first approach because it allows me to do the following without overwriting the $gt: 10 rule while the second would break the code by overwriting the internal rule.
getOrders({ optionalFilter: { amount: { $lt: 50 } } });
My question is, is there any advantage (performance or otherwise) of choosing one over the other?

My question is, is there any advantage (performance or otherwise) of choosing one over the other?
Short answer: No. If you are not using (compound) indexes
Long Answer: Okay, let's test this up, right?
So let's take a collection of ~100k documents, and un-indexed array field:
{asset_class: "COMMDTY"}
with {$and: [{item_subclass: "Herb", asset_class: "COMMDTY"}]}
without {item_subclass: "Herb", asset_class: "COMMDTY"}
Okay, there is 114ms difference between using $and and without. But does it means to stop you using $and and avoid it? Of course, not. But as much field have added, the more slower Mongo will become. But this whole picture changes, when I add an already indexed field to a query.
{expansion: "BFA", asset_class: "COMMDTY"}
So, if you add more fields via your mergedFilter then make sure that your index (if you are using them) can be used, because as for compound index field order is very meaningful. So the query: {asset_class: "COMMDTY", expansion: "BFA" } will goes already in:

Related

MongoDB bulkWrite multiple updateOne vs updateMany

I have cases where I build bulkWrite operations where some documents have the same update object, is there any performance benefit to merging the filters and send one updateMany with those filters instead of multiple updateOnes in the same bulkWrite?
It's obviously better to use updateMany over multiple updateOnes when using the normal methods, but with bulkWrite, since it's a single command, are there any significant gains of preferring one over the other?
Example:
I have 200k documents that I need to update, I have 10 total unique status field for all 200K documents, so my options are:
Solutions:
A) Send one single bulkWrite with 10 updateMany operations, and each one of those operations will affect 20K documents.
B) Send one single bulkWrite with 200K updateOne each operations holding its filter and status.
As #AlexBlex noted, I have to look out for accidentally updating more than one document with the same filter, in my case I use _id as my filter, so accidentally updating other documents is not a concern in my case, but is definitely something to look out for when considering the updateMany option.
Thanks #AlexBlex.
Short answer:
Using updateMany is at least twice faster, but might accidentally update more documents than you intended, keep reading to learn how to avoid this and gain the performance benefits.
Long answer:
We ran the following experiment to know the answer for that, the following are the steps:
Create a bankaccounts mongodb collection, each document contains only one field (balance).
Insert 1 million documents into the bankaccounts collection.
Randomize the order in memory of all 1 million documents to avoid any possible optimizations from the database using ids that are inserted in the same sequence, simulating a real-world scenario.
Build write operations for bulkWrite from the documents with a random number between 0 and 100.
Execute the bulkWrite.
Log the time the bulkWrite took.
Now, the experiment lies in the 4th step.
In one variation of the experiment we build an array consisting of 1 million updateOne operations, each updateOne has filter for a single document, and its respective `update object.
In the second variation, we build 100 updateMany operations, each including filter for 10K documents ids, and their respective update.
Results:
updateMany with multiple documents ids is 243% faster than multiple updateOnes, this can not be used everywhere though, please read "The risk" section to learn when it should be used.
Details:
We ran the script 5 times for each variation, the detailed results are as follows:
With updateOne: 51.28 seconds on average.
With updateMany: 21.04 seconds on average.
The risk:
As many people have already pointed out, updateMany is not a direct substitute to updateOne, since it can incorrectly update multiple documents when our intention was to really update only one document.
This approach is only valid when you're using a field that is unique such as _id or any other field that is unique, if the filter is depending on fields that are not unique, multiple documents will be updated and the results will not be equivalent.
65831219.js
// 65831219.js
'use strict';
const mongoose = require('mongoose');
const { Schema } = mongoose;
const DOCUMENTS_COUNT = 1_000_000;
const UPDATE_MANY_OPERATIONS_COUNT = 100;
const MINIMUM_BALANCE = 0;
const MAXIMUM_BALANCE = 100;
const SAMPLES_COUNT = 10;
const bankAccountSchema = new Schema({
balance: { type: Number }
});
const BankAccount = mongoose.model('BankAccount', bankAccountSchema);
mainRunner().catch(console.error);
async function mainRunner () {
for (let i = 0; i < SAMPLES_COUNT; i++) {
await runOneCycle(buildUpdateManyWriteOperations).catch(console.error);
await runOneCycle(buildUpdateOneWriteOperations).catch(console.error);
console.log('-'.repeat(80));
}
process.exit(0);
}
/**
*
* #param {buildUpdateManyWriteOperations|buildUpdateOneWriteOperations} buildBulkWrite
*/
async function runOneCycle (buildBulkWrite) {
await mongoose.connect('mongodb://localhost:27017/test', {
useNewUrlParser: true,
useUnifiedTopology: true
});
await mongoose.connection.dropDatabase();
const { accounts } = await createAccounts({ accountsCount: DOCUMENTS_COUNT });
const { writeOperations } = buildBulkWrite({ accounts });
const writeStartedAt = Date.now();
await BankAccount.bulkWrite(writeOperations);
const writeEndedAt = Date.now();
console.log(`Write operations took ${(writeEndedAt - writeStartedAt) / 1000} seconds with \`${buildBulkWrite.name}\`.`);
}
async function createAccounts ({ accountsCount }) {
const rawAccounts = Array.from({ length: accountsCount }, () => ({ balance: getRandomInteger(MINIMUM_BALANCE, MAXIMUM_BALANCE) }));
const accounts = await BankAccount.insertMany(rawAccounts);
return { accounts };
}
function buildUpdateOneWriteOperations ({ accounts }) {
const writeOperations = shuffleArray(accounts).map((account) => ({
updateOne: {
filter: { _id: account._id },
update: { balance: getRandomInteger(MINIMUM_BALANCE, MAXIMUM_BALANCE) }
}
}));
return { writeOperations };
}
function buildUpdateManyWriteOperations ({ accounts }) {
shuffleArray(accounts);
const accountsChunks = chunkArray(accounts, accounts.length / UPDATE_MANY_OPERATIONS_COUNT);
const writeOperations = accountsChunks.map((accountsChunk) => ({
updateMany: {
filter: { _id: { $in: accountsChunk.map(account => account._id) } },
update: { balance: getRandomInteger(MINIMUM_BALANCE, MAXIMUM_BALANCE) }
}
}));
return { writeOperations };
}
function getRandomInteger (min = 0, max = 1) {
min = Math.ceil(min);
max = Math.floor(max);
return min + Math.floor(Math.random() * (max - min + 1));
}
function shuffleArray (array) {
let currentIndex = array.length;
let temporaryValue;
let randomIndex;
// While there remain elements to shuffle...
while (0 !== currentIndex) {
// Pick a remaining element...
randomIndex = Math.floor(Math.random() * currentIndex);
currentIndex -= 1;
// And swap it with the current element.
temporaryValue = array[currentIndex];
array[currentIndex] = array[randomIndex];
array[randomIndex] = temporaryValue;
}
return array;
}
function chunkArray (array, sizeOfTheChunkedArray) {
const chunked = [];
for (const element of array) {
const last = chunked[chunked.length - 1];
if (!last || last.length === sizeOfTheChunkedArray) {
chunked.push([element]);
} else {
last.push(element);
}
}
return chunked;
}
Output
$ node 65831219.js
Write operations took 20.803 seconds with `buildUpdateManyWriteOperations`.
Write operations took 50.84 seconds with `buildUpdateOneWriteOperations`.
----------------------------------------------------------------------------------------------------
Tests were run using MongoDB version 4.0.4.
At high level, if you have same update object, then you can do updateMany rather than bulkWrite
Reason:
bulkWrite is designed to send multiple different commands to the server as mentioned here
If you have same update object, updateMany is best suited.
Performance:
If you have 10k update commands in bulkWrite, it will be executed batch manner internally. It may impact on the execution time
Exact lines from the reference about batching:
Each group of operations can have at most 1000 operations. If a group exceeds this limit, MongoDB will divide the group into smaller groups of 1000 or less. For example, if the bulk operations list consists of 2000 insert operations, MongoDB creates 2 groups, each with 1000 operations.
Thanks #Alex

MongoDB View vs Function to abstract query and variable/parameter passed

I hate to risk asking a duplicate question, but perhaps this is different from Passing Variables to a MongoDB View which didn't have any clear solution.
Below is a query to find the country for IP Address 16778237. (Outside the scope of this query, there is a formula that turns an IPV4 address into a number.)
I was wondering if we could abstract away this query out of NodeJS code, and make a view, so the view could be called from NodeJS. But the fields ipFrom and ipTo are indexed to get the query to run fast against millions of documents in the collection, so we can't return all the rows to NodeJS and filter there.
In MSSQL maybe this would have to be a stored procedure, instead of a view. Just trying to learn what is possible in MongoDB. I know there are functions, which are written in JavaScript. Is that where I need to look?
db['ip2Locations'].aggregate(
{
$match:
{
$and: [
{
"ipFrom": {
$lte: 16778237
}
},
{
"ipTo": {
$gte: 16778237
}
},
{
"active": true
}
],
$comment: "where 16778237 between startIPRange and stopIPRange and the row is 'Active',sort by createdDateTime, limit to the top 1 row, and return the country"
}
},
{
$sort:
{
'createdDateTime': - 1
}
},
{
$project:
{
'countryCode': 1
}
},
{
$limit: 1
}
)
Part 2 - after more research and experimenting, I found this is possible and runs with success, but then see trying to make a view below this query.
var ipaddr = 16778237
db['ip2Locations'].aggregate(
{
$match:
{
$and: [
{
"ipFrom": {
$lte: ipaddr
}
},
{
"ipTo": {
$gte: ipaddr
}
},
{
"active": true
}
],
$comment: "where 16778237 between startIPRange and stopIPRange and the row is 'Active',sort by createdDateTime, limit to the top 1 row, and return the country"
}
},
{
$sort:
{
'createdDateTime': - 1
}
},
{
$project:
{
'countryCode': 1
}
},
{
$limit: 1
}
)
If I try to create a view with a "var" in it, like this;
db.createView("ip2Locations_vw-lookupcountryfromip","ip2Locations",[
var ipaddr = 16778237
db['ip2Locations'].aggregate(
I get error:
[Error] SyntaxError: expected expression, got keyword 'var'
In the link I provided above, I think the guy was trying to figure how the $$user-variables work (no example here: https://docs.mongodb.com/manual/reference/aggregation-variables/). That page refers to $let, but never shows how the two work together. I found one example here: https://www.tutorialspoint.com/mongodb-query-to-set-user-defined-variable-into-query on variables, but not $$variables. I'm
db.createView("ip2Locations_vw-lookupcountryfromip","ip2Locations",[
db['ip2Locations'].aggregate(
...etc...
"ipFrom": {
$lte: $$ipaddr
}
I tried ipaddr, $ipaddr, and $$ipaddr, and they all give a variation of this error:
[Error] ReferenceError: $ipaddr is not defined
In a perfect world, one would be able to do something like:
get['ip2Locations_vw-lookupcountryfromip'].find({$let: {'ipaddr': 16778237})
or similar.
I'm getting that it's possible with Javascript stored in MongoDB (How to use variables in MongoDB query?), but I'll have to re-read that; seems like some blogs were warning against it.
I have yet to find a working example using $$user-variables, still looking.
Interpretation
You want to query a view from some server side code, passing a variable to it.
Context
Can we use an external variable to recompute a View? Take the following pipeline:
var pipeline = [{ $group:{ _id:null, useless:{ $push:"$$NOW" } } }]
We can pass system variables using $$. We can define user variables too, but the user defined variables are made out of:
Collection Data
System Variables.
Also, respect to your Part2:
A variable var variable="what" will be computed only once. Redefine variable="whatever" makes no difference in the view, it uses "what".
Conclusion
Views can only be re-computed with system variables, or user variables dependant on those system variables or collection data.
Added an answer to the post you link too.

Updating a value that is dependent on a newly updated document key

My goal is to add a comment to my CommentFeed and while doing that I want to push that comment into my topComments field and also update the 'numOfComments' . I want to limit the topComments to only 3 comments (How would I even set that up?). And how do I take the previous value of numOfComments and add one to it?
CommentFeed.findOneAndUpdate(
{ _id: commentId },
{
$push: {
comments: {
text: req.body.text
},
$push: topComments:{text: req.body.text}, <--- Limit this somehow to only allow an array length of 3?
$set: numOfComments: ? , <---What kind of logic is used here?
}
},
{ new: true }
)
CommentFeed Schema
const CommentFeedSchema = new Schema({
topComments:[{text:{type:String}}],
numOfComments:{type:Number},
comments: [
text: { type: String, required: true }
]});
For the first issue (limiting the topComments array size) you can use the $slice operator. This has already been answered in other questions. But you might consider computing topComments from comments using the$slice operator in the projection argument:
CommentFeed.find( {}, { comments: { $slice: -3 } } )
For the second issue (updating a document using existing fields from that document), it is not something you can do in a simple findOneAndUpdate call. This was also discussed in other questions.
But you might consider computing numOfComments instead of updating it every time. You can do that with the $size operator of the aggregation framework:
CommentFeed.aggregate({$project: { numOfComments: { $size:"$comments" }}})

mapreduce between consecutive documents

Setup:
I got a large collection with the following entries
Name - String
Begin - time stamp
End - time stamp
Problem:
I want to get the gaps between documents, Using the map-reduce paradigm.
Approach:
I'm trying to set a new collection of pairs mid, after that I can compute differences from it using $unwind and Pair[1].Begin - Pair[0].End
function map(){
emit(0, this)
}
function reduce(){
var i = 0;
var pairs = [];
while ( i < values.length -1){
pairs.push([values[i], values[i+1]]);
i = i + 1;
}
return {"pairs":pairs};
}
db.collection.mapReduce(map, reduce, sort:{begin:1}, out:{replace:"mid"})
This works with limited number of document because of the 16MB document cap. I'm not sure if I need to get the collection into memory and doing it there, How else can I approach this problem?
The mapReduce function of MongoDB has a different way of handling what you propose than the method you are using to solve it. The key factor here is "keeping" the "previous" document in order to make the comparison to the next.
The actual mechanism that supports this is the "scope" functionality, which allows a sort of "global" variable approach to use in the overall code. As you will see, what you are asking when that is considered takes no "reduction" at all as there is no "grouping", just emission of document "pair" data:
db.collection.mapReduce(
function() {
if ( last == null ) {
last = this;
} else {
emit(
{
"start_id": last._id,
"end_id": this._id
},
this.Begin - last.End
);
last = this;
}
},
function() {}, // no reduction required
{
"out": { "inline": 1 },
"scope": { "last": null }
}
)
Out with a collection as the output as required to your size.
But this way by using a "global" to keep the last document then the code is both simple and efficient.

Average Aggregation Queries in Meteor

Ok, still in my toy app, I want to find out the average mileage on a group of car owners' odometers. This is pretty easy on the client but doesn't scale. Right? But on the server, I don't exactly see how to accomplish it.
Questions:
How do you implement something on the server then use it on the client?
How do you use the $avg aggregation function of mongo to leverage its optimized aggregation function?
Or alternatively to (2) how do you do a map/reduce on the server and make it available to the client?
The suggestion by #HubertOG was to use Meteor.call, which makes sense and I did this:
# Client side
Template.mileage.average_miles = ->
answer = null
Meteor.call "average_mileage", (error, result) ->
console.log "got average mileage result #{result}"
answer = result
console.log "but wait, answer = #{answer}"
answer
# Server side
Meteor.methods average_mileage: ->
console.log "server mileage called"
total = count = 0
r = Mileage.find({}).forEach (mileage) ->
total += mileage.mileage
count += 1
console.log "server about to return #{total / count}"
total / count
That would seem to work fine, but it doesn't because as near as I can tell Meteor.call is an asynchronous call and answer will always be a null return. Handling stuff on the server seems like a common enough use case that I must have just overlooked something. What would that be?
Thanks!
As of Meteor 0.6.5, the collection API doesn't support aggregation queries yet because there's no (straightforward) way to do live updates on them. However, you can still write them yourself, and make them available in a Meteor.publish, although the result will be static. In my opinion, doing it this way is still preferable because you can merge multiple aggregations and use the client-side collection API.
Meteor.publish("someAggregation", function (args) {
var sub = this;
// This works for Meteor 0.6.5
var db = MongoInternals.defaultRemoteCollectionDriver().mongo.db;
// Your arguments to Mongo's aggregation. Make these however you want.
var pipeline = [
{ $match: doSomethingWith(args) },
{ $group: {
_id: whatWeAreGroupingWith(args),
count: { $sum: 1 }
}}
];
db.collection("server_collection_name").aggregate(
pipeline,
// Need to wrap the callback so it gets called in a Fiber.
Meteor.bindEnvironment(
function(err, result) {
// Add each of the results to the subscription.
_.each(result, function(e) {
// Generate a random disposable id for aggregated documents
sub.added("client_collection_name", Random.id(), {
key: e._id.somethingOfInterest,
count: e.count
});
});
sub.ready();
},
function(error) {
Meteor._debug( "Error doing aggregation: " + error);
}
)
);
});
The above is an example grouping/count aggregation. Some things of note:
When you do this, you'll naturally be doing an aggregation on server_collection_name and pushing the results to a different collection called client_collection_name.
This subscription isn't going to be live, and will probably be updated whenever the arguments change, so we use a really simple loop that just pushes all the results out.
The results of the aggregation don't have Mongo ObjectIDs, so we generate some arbitrary ones of our own.
The callback to the aggregation needs to be wrapped in a Fiber. I use Meteor.bindEnvironment here but one can also use a Future for more low-level control.
If you start combining the results of publications like these, you'll need to carefully consider how the randomly generated ids impact the merge box. However, a straightforward implementation of this is just a standard database query, except it is more convenient to use with Meteor APIs client-side.
TL;DR version: Almost anytime you are pushing data out from the server, a publish is preferable to a method.
For more information about different ways to do aggregation, check out this post.
I did this with the 'aggregate' method. (ver 0.7.x)
if(Meteor.isServer){
Future = Npm.require('fibers/future');
Meteor.methods({
'aggregate' : function(param){
var fut = new Future();
MongoInternals.defaultRemoteCollectionDriver().mongo._getCollection(param.collection).aggregate(param.pipe,function(err, result){
fut.return(result);
});
return fut.wait();
}
,'test':function(param){
var _param = {
pipe : [
{ $unwind:'$data' },
{ $match:{
'data.y':"2031",
'data.m':'01',
'data.d':'01'
}},
{ $project : {
'_id':0
,'project_id' : "$project_id"
,'idx' : "$data.idx"
,'y' : '$data.y'
,'m' : '$data.m'
,'d' : '$data.d'
}}
],
collection:"yourCollection"
}
Meteor.call('aggregate',_param);
}
});
}
If you want reactivity, use Meteor.publish instead of Meteor.call. There's an example in the docs where they publish the number of messages in a given room (just above the documentation for this.userId), you should be able to do something similar.
You can use Meteor.methods for that.
// server
Meteor.methods({
average: function() {
...
return something;
},
});
// client
var _avg = { /* Create an object to store value and dependency */
dep: new Deps.Dependency();
};
Template.mileage.rendered = function() {
_avg.init = true;
};
Template.mileage.averageMiles = function() {
_avg.dep.depend(); /* Make the function rerun when _avg.dep is touched */
if(_avg.init) { /* Fetch the value from the server if not yet done */
_avg.init = false;
Meteor.call('average', function(error, result) {
_avg.val = result;
_avg.dep.changed(); /* Rerun the helper */
});
}
return _avg.val;
});