Calculate amount of value changes in MongoDB - mongodb

I have a database for example:
{ "_id": ObjectId("54575132a8269c77675ace49"),"power": false, "time": 1415008560000}
{ "_id": ObjectId("54575132a8269c77675ace50"),"power": true, "time": 1415008570000}
{ "_id": ObjectId("54575132a8269c77675ace51"),"power": false, "time": 1415008580000}
{ "_id": ObjectId("54575132a8269c77675ace52"),"power": false, "time": 1415008590000}
{ "_id": ObjectId("54575132a8269c77675ace53"),"power": true, "time": 1415008600000}
{ "_id": ObjectId("54575132a8269c77675ace54"),"power": false, "time": 1415008610000}
How can I calculate amount of power changes from true to false and opposite?
I could iterate through all entries and increase some variable if previous value is different than actual, but how to do this in mongo?
For this example result should be 4

You could use the aggregation framework to do this:
db.yourCollection.aggregate({ $group:{ _id:"$power", count:{$sum:1} } })
which should give you the following result:
{_id:true,count:2}
{_id:false, count:4}
By subtracting the difference of those two values from the total document count (db.yourCollection.count()), you should have the number of state changes:
var cursor = db.yourCollection.aggregate({ $group:{ _id:"$power", count:{$sum:1} } });
var count = db.yourCollection.count();
var changes = count - Math.abs(cursor[0].count - cursor[1].count);
EDIT: Revised approach
As per #JohhnyHK's sharp eye, he found a problem with the above. All kudos, upvotes and alike to him.
Calculating the number of changes
In order to calculate the changes even for large collections efficiently, with the given constraints, once could use map/reduce to count the changes, which should be pretty efficient even for very large collections.
var numberOfStateChanges = db.yourCollection.mapReduce(
// Mapping function
function(){
// Since in the sample data, there is no reasonable
// field for a key, we use an artificial one: 0
emit(0,this.power);
},
// Reduce function
function(key,values){
// The initial number of changes is 0
var changes=0;
// Our initial state, which does not count towards the changes,...
var state = values[0];
// ... hence we start to compare with the second item in the values array
for (var idx=1; idx < value.length; idx++){
// In case the current state is different from
// the one we are comparing with it, we have a state change
if(value[idx] != state) {
//... which we count...
changes +=1;
// ...and save.
state=value[idx]
}
}
return changes;
},
{
// We make sure the values are fed into the map function in the correct order
sort:{time:1},
// and return it directly instead of putting it into a collection, so we can process it
out:{inline:1}
}
).results[0].value
Now numberOfStateChanges holds the correct number of state changes.
Note
In order to have this map/reduce to be processed efficiently, we need an index on the field we are sorting by, time:
db.yourCollection.ensureIndex({time:1})

Related

Update with upsert, but only update if date field of document in db is less than updated document

I am having a bit of an issue trying to come up with the logic for this. So, what I want to do is:
Bulk update a bunch of posts to my remote MongoDB instance BUT
If update, only update if lastModified field on the remote collection is less than lastModified field in the same document that I am about to update/insert
Basically, I want to update my list of documents if they have been modified since the last time I updated them.
I can think of two brute force ways to do it...
First, querying my entire collection, trying to manually remove and replace the documents that match the criteria, add the new ones, and then mass insert everything back to the remote collection after deleting everything in remote.
Second, query each item and then deciding, if there is one in remote, if I want to update it or no. This seems like it would be very tasking when dealing with remote collections.
If relevant, I am working on a NodeJS environment, using the mondodb npm package for database operations.
You can use the bulkWrite API to carry out the updates based on the logic you specified as it handles this better.
For example, the following snippet shows how to go about this assuming you already have the data from the web service you need to update the remote collection with:
mongodb.connect(mongo_url, function(err, db) {
if(err) console.log(err);
else {
var mongo_remote_collection = db.collection("remote_collection_name");
/* data is from http call to an external service or ideally
place this within the service callback
*/
mongoUpsert(mongo_remote_collection, data, function() {
db.close();
})
}
})
function mongoUpsert(collection, data_array, cb) {
var ops = data_array.map(function(data) {
return {
"updateOne": {
"filter": {
"_id": data._id, // or any other filtering mechanism to identify a doc
"lastModified": { "$lt": data.lastModified }
},
"update": { "$set": data },
"upsert": true
}
};
});
collection.bulkWrite(ops, function(err, r) {
// do something with result
});
return cb(false);
}
If the data from the external service is huge then consider sending the writes to the server in batches of 500 which gives you a better performance as you are not sending every request to the server, just once in every 500 requests.
For bulk operations MongoDB imposes a default internal limit of 1000 operations per batch and so the choice of 500 documents is good in the sense that you have some control over the batch size rather than let MongoDB impose the default, i.e. for larger operations in the magnitude of > 1000 documents. So for the above case in the first approach one could just write all the array at once as this is small but the 500 choice is for larger arrays.
var ops = [],
counter = 0;
data_array.forEach(function(data) {
ops.push({
"updateOne": {
"filter": {
"_id": data._id,
"lastModified": { "$lt": data.lastModified }
},
"update": { "$set": data },
"upsert": true
}
});
counter++;
if (counter % 500 === 0) {
collection.bulkWrite(ops, function(err, r) {
// do something with result
});
ops = [];
}
})
if (counter % 500 != 0) {
collection.bulkWrite(ops, function(err, r) {
// do something with result
}
}

MongoDB - Update text to Proper / title Case

We have a large collection of documents with various text case when entered for their description
eg
Desc =
'THE CAT"
or
"The Dog"
or
"the cow"
We want to make all consistent in Title (Or Proper case) where first letter of each word is upper and rest lower case.
"The Cat", "The Dog", "The Cow"
Looking for assistance in creating update query to do that on mass, rather than manual as data team is doing at present.
thanks
The algorithm for changing the title case below uses Array.prototype.map() method and the String.prototype.replace() method which returns a new string with some or all matches of a pattern replaced by a replacement.
In your case, the pattern for the replace() method will be a String to be replaced by a new replacement and will be treated as a verbatim string.
First off, you need to lowercase and split the string before applying the map() method. Once you define a function that implements the conversion, you then need to iterate your collection to apply an update with this function. Use the cursor.forEach() method on the cursor returned by find() to do the loop and within the loop you can then run an update on each document using the updateOne() method.
For relatively small datasets, the whole operation can be described by the following
function titleCase(str) {
return str.toLowerCase().split(' ').map(function(word) {
return word.replace(word[0], word[0].toUpperCase());
}).join(' ');
}
db.collection.find({}).forEach(function(doc){
db.collection.updateOne(
{ "_id": doc._id },
{ "$set": { "desc": titleCase(doc.desc) } }
);
});
For improved performance especially when dealing with huge datasets, take advantage of using a Bulk() API for updating the collection efficiently in bulk as you will be sending the operations to the server in batches (for example, say a batch size of 500). This gives you much better performance since you won't be sending every request to the server but just once in every 500 requests, thus making your updates more efficient and quicker.
The following demonstrates this approach, the first example uses the Bulk() API available in MongoDB versions >= 2.6 and < 3.2. It updates all the documents in the collection by transforming the title on the desc field using the above function.
MongoDB versions >= 2.6 and < 3.2:
function titleCase(str) {
return str.toLowerCase().split(' ').map(function(word) {
return word.replace(word[0], word[0].toUpperCase());
}).join(' ');
}
var bulk = db.collection.initializeUnorderedBulkOp(),
counter = 0;
db.collection.find().forEach(function (doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "desc": titleCase(doc.desc) }
});
counter++;
if (counter % 500 === 0) {
// Execute per 500 operations
bulk.execute();
// re-initialize every 500 update statements
bulk = db.collection.initializeUnorderedBulkOp();
}
})
// Clean up remaining queue
if (counter % 500 !== 0) { bulk.execute(); }
The next example applies to the new MongoDB version 3.2 which has since deprecated the Bulk() API and provided a newer set of apis using bulkWrite().
MongoDB version 3.2 and greater:
var ops = [],
titleCase = function(str) {
return str.toLowerCase().split(' ').map(function(word) {
return word.replace(word[0], word[0].toUpperCase());
}).join(' ');
};
db.Books.find({
"title": {
"$exists": true,
"$type": 2
}
}).forEach(function(doc) {
ops.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$set": { "title": titleCase(doc.title) }
}
}
});
if (ops.length === 500 ) {
db.Books.bulkWrite(ops);
ops = [];
}
})
if (ops.length > 0)
db.Books.bulkWrite(ops);

mapreduce between consecutive documents

Setup:
I got a large collection with the following entries
Name - String
Begin - time stamp
End - time stamp
Problem:
I want to get the gaps between documents, Using the map-reduce paradigm.
Approach:
I'm trying to set a new collection of pairs mid, after that I can compute differences from it using $unwind and Pair[1].Begin - Pair[0].End
function map(){
emit(0, this)
}
function reduce(){
var i = 0;
var pairs = [];
while ( i < values.length -1){
pairs.push([values[i], values[i+1]]);
i = i + 1;
}
return {"pairs":pairs};
}
db.collection.mapReduce(map, reduce, sort:{begin:1}, out:{replace:"mid"})
This works with limited number of document because of the 16MB document cap. I'm not sure if I need to get the collection into memory and doing it there, How else can I approach this problem?
The mapReduce function of MongoDB has a different way of handling what you propose than the method you are using to solve it. The key factor here is "keeping" the "previous" document in order to make the comparison to the next.
The actual mechanism that supports this is the "scope" functionality, which allows a sort of "global" variable approach to use in the overall code. As you will see, what you are asking when that is considered takes no "reduction" at all as there is no "grouping", just emission of document "pair" data:
db.collection.mapReduce(
function() {
if ( last == null ) {
last = this;
} else {
emit(
{
"start_id": last._id,
"end_id": this._id
},
this.Begin - last.End
);
last = this;
}
},
function() {}, // no reduction required
{
"out": { "inline": 1 },
"scope": { "last": null }
}
)
Out with a collection as the output as required to your size.
But this way by using a "global" to keep the last document then the code is both simple and efficient.

Scope work strangely in mapReduce of MongoDB for the purpose of producing cumulative frequency

I have a collection called user, and I want to get cumulative frequency of number of users by date based on the _id field. The desired result should be something like that:
{
{_id: 2013-12-02, value: 10}, //upto 2013-12-02 there are 10 users
{_id: 2014-01-05, value: 20}, //upto 2014-01-05 there are totally 20 users
….
}
I try to get the above using the following mapReduce call:
db.user.mapReduce(
function(){var date = this._id.getTimestamp();
emit(new Date(date.getFullYear()+"-"+date.getMonth()+"-"+date.getDate()), 1)},
function(key, values) {cum = cum + Array.sum(values); return cum},
{out: "newUserAnalysis",
sort: {_id: 1},
scope: {cum: 0}})
But it seems that the cum variable reset to zero after the first return statement encountered in the reduce function. Why? Is there any other method to get what I want?
Many thanks.
cum should not be reset as it's a global variable in map, reduce and finalize functions during the whole mapReduce processing.
But reduce function has 3 requirements to be observed to assure processing correctly, particularly for bulky data handling since reduce function will be called repeatedly even on the same key. Normally the length of values in map function would not exceed 100. In a word, your design can't assure cum is called on the right sequence as you expect, which will produce incorrect statistics.
Following code for your reference:
// map and reduce per day then save to a collection.
db.user.mapReduce(function() {
var date = this._id.getTimestamp();
emit(new Date(date.getFullYear() + "-" + (date.getMonth() + 1) + "-"
+ date.getDate()), 1);
}, function(key, values) {
return Array.sum(values);
}, {
out : "newUserAnalysis",
sort : {
_id : 1
}
});
// Do accumulation one by one.
var cursor = db.newUserAnalysis.find().sort({_id:1});
var newValue = 0, first = true;
while (cursor.hasNext()) {
var doc = cursor.next();
newValue += doc.value;
if (first) {
first = false;
} else {
db.newUserAnalysis.update({_id:doc._id}, {$set:{value:newValue}});
}
}

MongoDB map reduce producing different result to db.collection.find()

I have a map reduce like this:
map:
function() {
emit(this.username, {sent:this.sent, received:this.received});
}
reduce:
function(key, values) {
var result = {sent: 0, received: 0, entries:0};
values.forEach(function (value) {
result.sent += value.sent;
result.received += value.received;
result.entries += 1;
});
return result;
}
I've been monitoring the amount of entries processed in the result map, as you can see. I've found I get much lower numbers of accessed records than I should.
For my particular data set, the output is like so:
[{u'_id': u'1743', u'value': {u'received': 1406545.0, u'sent': 26251138.0, u'entries': 316.0}}]
As I'm running the map reduce with a query option, specifying a username and a date range.
If I perform the same query using db.collection.find() as follows, the count is different:
> db.entire_database.find({username: '1743', time : { $lte: ISODate('2011-08-12 12:40:00'), $gte: ISODate('2011-08-12 08:40:00') }}).count()
1915
The full map reduce query is this:
db.entire_database.mapReduce(m, r, {out: 'myoutput', query: { username: '1743', time : { $lte: ISODate('2011-08-12 12:40:00'), $gte: ISODate('2011-08-12 08:40:00') } } })
So basically, I'm unsure why the count is so radically different? Why is the find() giving me 1915, but the map reduce is 316?
Your map function needs to emit an object with the same form as the reduce function (ie. it should have an entries field set to 1). You can read more about this here.
Basically, the values that are passed to the reduce function are not necessarily the raw outputs emitted from map. Rather than being called once, the reduce function is called many times on 'groups' of values produced by map, the results of which are then combined again by being passed into a further call of the reduce function. This is what makes MapReduce horizontally scalable, because any group of emitted values can be farmed out to any server in any order before being combined later.
So I would restructure your functions slightly like this:
map:
function() {
emit(this.username, {sent:this.sent, received:this.received, entries : 1});
}
reduce:
function(key, values) {
var result = {sent: 0, received: 0, entries:0};
values.forEach(function (value) {
result.sent += value.sent;
result.received += value.received;
result.entries += value.entries;
});
return result;
}